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Abstract 

In  this  dissertation  we  investigate  the  problem  of  reasoning  over  evolving  structures 
which  describe  the  dependence  among  multiple,  possibly  vector- valued,  time-series. 
Such  problems  arise  naturally  in  variety  of  settings.  Consider  the  problem  of  object 
interaction  analysis.  Given  tracks  of  multiple  moving  objects  one  may  wish  to  describe 
if  and  how  these  objects  are  interacting  over  time.  Alternatively,  consider  a  scenario  in 
which  one  observes  multiple  video  streams  representing  participants  in  a  conversation. 
Given  a  single  audio  stream,  one  may  wish  to  determine  with  which  video  stream  the 
audio  stream  is  associated  as  a  means  of  indicating  who  is  speaking  at  any  point  in 
time.  Both  of  these  problems  can  be  cast  as  inference  over  dependence  structures. 

In  the  absence  of  training  data,  such  reasoning  is  challenging  for  several  reasons. 
If  one  is  solely  interested  in  the  structure  of  dependence  as  described  by  a  graphical 
model,  there  is  the  question  of  how  to  account  for  unknown  parameters.  Additionally, 
the  set  of  possible  structures  is  generally  super-exponential  in  the  number  of  time  series. 
Furthermore,  if  one  wishes  to  reason  about  structure  which  varies  over  time,  the  number 
of  structural  sequences  grows  exponentially  with  the  length  of  time  being  analyzed. 

We  present  tractable  methods  for  reasoning  in  such  scenarios.  We  consider  two  ap¬ 
proaches  for  reasoning  over  structure  while  treating  the  unknown  parameters  as  nuisance 
variables.  First,  we  develop  a  generalized  likelihood  approach  in  which  point  estimates 
of  parameters  are  used  in  place  of  the  unknown  quantities.  We  explore  this  approach 
in  scenarios  in  which  one  considers  a  small  enumerated  set  of  specified  structures.  Sec¬ 
ond,  we  develop  a  Bayesian  approach  and  present  a  conjugate  prior  on  the  parameters 
and  structure  of  a  model  describing  the  dependence  among  time-series.  This  allows 
for  Bayesian  reasoning  over  structure  while  integrating  over  parameters.  The  modular 
nature  of  the  prior  we  define  allows  one  to  reason  over  a  super-exponential  number  of 
structures  in  exponential-time  in  general.  Furthermore,  by  imposing  simple  local  or 
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global  structural  constraints  we  show  that  one  can  reduce  the  exponential-time  com¬ 
plexity  to  polynomial-time  complexity  while  still  reasoning  over  a  super-exponential 
number  of  candidate  structures. 

We  cast  the  problem  of  reasoning  over  temporally  evolving  structures  as  inference 
over  a  latent  state  sequence  which  indexes  structure  over  time  in  a  dynamic  Bayesian 
network.  This  model  allows  one  to  utilize  standard  algorithms  such  as  Expectation 
Maximization,  Viterbi  decoding,  forward-backward  messaging  and  Gibbs  sampling  in 
order  to  efficiently  reasoning  over  an  exponential  number  of  structural  sequences.  We 
demonstrate  the  utility  of  our  methodology  on  two  tasks:  audio-visual  association  and 
moving  object  interaction  analysis.  We  achieve  state-of-the-art  performance  on  a  stan¬ 
dard  audio-visual  dataset  and  show  how  our  model  allows  one  to  tractably  make  exact 
probabilistic  statements  about  interactions  among  multiple  moving  objects. 
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Chapter  1 


Introduction 


In  this  dissertation  we  investigate  the  problem  of  analyzing  the  statistical  dependence 
among  multiple  time-series.  The  problem  is  cast  in  terms  of  inference  over  structure 
used  by  a  probabilistic  model  describing  the  evolution  of  these  time-series,  such  as  an 
undirected  graphical  model  or  directed  Bayesian  Network.  We  are  primarily  concerned 
with  the  structure  of  dependence  described  by  the  presence  or  absence  of  edges  in  a 
graphical  model  and  as  such  we  treat  model  parameters  as  nuisance  variables.  Building 
upon  the  large  body  of  work  on  structure  learning  and  dynamic  Bayesian  networks  we 
present  a  model  which  yields  tractable  inference  of  statistical  dependence  relationships 
which  change  over  time.  We  demonstrate  the  utility  of  our  method  on  audio-visual 
association  and  object-interaction  analysis  tasks. 

■  1.1  Motivation 

When  designing  and  building  intelligent  systems  it  would  be  beneficial  to  give  them  the 
ability  to  combine  multiple  sources  of  sensory  information  for  the  purpose  of  interpreting 
and  understanding  the  environment  in  which  they  operate.  Fusion  of  multiple  sources 
of  disparate  information  enables  systems  to  operate  more  robustly  and  naturally.  One 
inspirational  example  of  such  a  system  is  the  human  being.  Our  five  basic  sensing 
modalities  of  sight,  hearing,  touch,  taste  and  smell  are  combined  to  provide  a  view  of 
our  environment  that  is  richer  than  using  any  single  sense  in  isolation.  In  addition  to 
being  inherently  multi-modal,  human  perception  takes  advantages  of  multiple  sources 
of  information  within  a  single  modality.  We  have  two  eyes,  two  ears  and  multiple 
pathways  for  our  sense  of  touch.  Moving  beyond  the  act  of  sensing,  the  human  brain  is 
highly  effective  at  extracting,  combining  and  interpreting  higher-level  information  from 
these  sources.  For  example,  using  our  visual  input  we  are  able  to  track  multiple  moving 
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objects  and  combine  this  information  with  what  we  hear  in  order  to  help  identify  the 
source  of  a  sound. 

There  are  two  main  challenges  when  designing  systems  that  fuse  information  from 
multiple  time-series.  The  first  challenge  is  choosing  how  to  represent  the  incoming  data. 
The  high  dimensional  nature  of  each  data  source  makes  processing  the  raw  sensory 
input  computationally  burdensome.  Extracting  features  from  each  data  source  can 
alleviate  this  problem  by  eliminating  information  that  is  irrelevant  or  inherently  noisy. 
For  example,  computation  could  be  reduced  if  one  were  able  to  extract  the  location  of 
multiple  objects  in  a  scene  to  analyze  their  behaviors  rather  than  processing  the  entire 
visual  scene  as  a  whole.  Even  if  dimensionality  is  low  or  features  are  extracted,  the 
important  question  of  determine  which  features  are  informative  for  the  specific  task 
remains.  There  has  been  considerable  work  on  the  task  of  general  feature  selection 
[41,  8,  53,  55]  and  on  learning  informative  representations  across  multiple  data  sources 
[29,  84,  30].  This  dissertation  will  assume  the  challenge  of  representation  has  been 
addressed  by  one  of  these  existing  approaches  or  by  carefully  hand  picking  features 
which  are  sufficiently  informative  for  the  application  of  interest. 

The  second  challenge  is  that  of  integration.  Once  features  are  extracted  the  there 
is  a  question  of  how  information  obtained  from  them  can  be  effectively  combined.  Sim¬ 
ple  strategies  can  be  devised  assuming  the  data  sources  are  independent  of  each  other. 
While  computationally  simple,  such  strategies  neglect  to  take  advantage  of  any  shared 
information  among  the  inputs.  The  opposite  extreme  is  to  consider  all  possible  relation¬ 
ships  among  the  inputs.  This  comes  at  the  cost  of  a  more  complex  model  for  integration. 
Thus,  a  key  task  for  fusing  information  among  multiple  data  sources  is  to  identify  the 
relationships  among  them.  If  one  knew  that  the  dependence  among  the  data  sources 
this  knowledge  could  be  exploited  to  perform  efficient  integration.  This  is  the  main 
focus  of  this  dissertation:  developing  techniques  for  identifying  the  relationships  among 
the  multiple  sources  of  information.  In  many  tasks,  identify  these  relationships  is  the 
main  result  one  is  interested  in. 

Consider  a  scene  in  which  there  are  several  individuals,  each  of  whom  may  be 
speaking  at  any  given  moment.  Assume  a  recording  of  the  scene  produces  a  single 
audio  stream  in  addition  to  a  wide  angle  video  stream.  For  each  individual  in  the  scene 
a  separate  video  stream  representing  a  region  of  the  input  video  can  be  extracted.  Given 
this  data  over  a  long  period  of  time,  one  useful  task  is  to  determine  who,  if  anyone,  is 
speaking  at  each  point  in  time.  Humans  are  very  effective  at  this  task.  The  solution  to 
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this  audio-visual  association  problem  has  wide  applicability  to  tasks  such  as  automatic 
meeting  transcription,  social  interaction  analysis,  and  control  of  human-computer  dialog 
systems.  The  semantic  label  identifying  who  is  speaking  can  be  related  to  associating 
the  audio  stream  with  many,  one  or  none  of  the  video  streams.  For  example,  the 
identification  that  “Victoria  is  speaking”  indicates  that  the  video  stream  for  Victoria  is 
associated  with  the  audio  stream. 

Next,  consider  a  scene  with  multiple  moving  objects.  For  example,  think  about  a 
basketball  game  in  which  the  moving  objects  are  the  players  on  each  team  and  the  ball. 
Given  measurements  of  the  position  of  these  objects  over  a  long  duration  of  time,  a 
question  one  may  ask  is:  Which,  if  any,  of  these  objects  are  interacting  at  each  point 
in  time?  The  answer  to  this  question  is  useful  for  a  variety  of  applications  including 
anomalous  event  detection  and  automatic  scene  summarization.  In  this  specific  example 
of  a  basketball  game,  understanding  the  interactions  among  players  can  help  one  identify 
a  particular  play,  which  team  is  on  offense,  and  who  is  covering  whom.  This  is  a  common 
and  natural  task  regularly  performed  by  humans.  Heider  and  Simmel  [45]  note  that 
when  presented  a  cartoon  of  simple  shapes  moving  on  a  screen,  one  will  tend  to  describe 
their  behavior  with  terms  such  as  “chasing” ,  “following”  or  being  independent  of  one 
another.  These  semantic  labels  describe  the  interaction  between  objects  and  allow 
humans  to  provide  a  compact  description  of  what  they  are  observing. 

Underlying  both  of  these  problems  is  the  task  of  identifying  associations  or  inter¬ 
actions  among  the  observed  time-series.  The  words  association  and  interaction  both 
describe  statistical  dependence.  When  we  say  the  audio  stream  is  associated  with  video 
stream  for  Victoria,  we  are  implying  the  two  time-series  share  information  and  should 
not  be  modeled  as  being  independent  of  one  another.  Similarly,  when  we  claim  two 
objects  are  interacting  we  are  implying  some  statistical  dependence  between  them.  For 
example,  a  “following”  interaction  implies  a  causal  dependence  between  the  leader’s 
current  position  and  the  follower’s  future  position. 

Note  that  with  each  of  these  dependence  relationship  there  is  the  question  of  iden¬ 
tifying  the  nature  of  that  dependence.  There  are  many  ways  in  which  time-series  can 
be  dependent  on  each  other.  Associated  with  a  particular  dependence  structure  are 
a  set  of  parameters  describing  that  dependence.  For  example,  these  parameters  will 
describe  the  tone  of  Victoria’s  voice  in  the  audio  stream  and  exactly  how  motion  in  the 
video  stream  is  associated  with  changes  in  the  audio.  In  the  object  interaction  analysis 
scenario  the  parameters  describing  the  causal  dependence  for  a  “following”  behavior 
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will  characterize  how  fast  the  objects  are  moving  and  how  closely  the  leader  is  being 
followed.  Our  primary  interest  is  in  identifying  the  presence  or  absence  of  dependence 
rather  than  accurately  characterizing  the  nature  of  that  dependence  describe  by  these 
parameters. 

■  1.2  Objective 

We  develop  a  general  framework  for  the  task  of  dynamic  dependence  analysis.  The 
primary  input  to  a  system  performing  dynamic  dependence  analysis  is  a  finite  set  of 
discrete-time  vector-valued  time-series  jointly  sampled  over  a  fixed  period  of  time.  The 
output  of  the  system  is  a  description  of  the  statistical  dependence  among  the  time-series 
at  each  point  in  time.  This  description  may  be  in  the  form  of  a  point  estimate  of  or 
distribution  over  dependence  structure.  The  techniques  presented  in  this  dissertation 
have  additional  inputs  such  as: 

•  The  maximum  number  of  possible  dependence  relationships  the  system  should 
consider  modeling  the  time-series  changing  among  over  time. 

•  A  specified  class  of  static  dependence  models  that  will  be  used  to  described  a  fixed 
dependence  among  the  time-series. 

•  A  prior  on  the  dependence  structures  of  interest  and  an  optional  prior  on  param¬ 
eters  describing  the  nature  of  dependence. 

■  1.3  Our  Approach 

Past  approaches  to  the  problem  of  audio-visual  association  have  either  been  based  on 
basic  measures  of  dependence  over  fixed  windows  of  time  [46,  84,  30,  68]  or  on  incorpo¬ 
rating  training  data  to  help  classify  correct  speaker  association  [68,  72],  Similarly,  past 
approaches  for  object  interaction  analysis  have  relied  on  measuring  statistical  depen¬ 
dence  among  small  groups  of  objects  assuming  the  interaction  is  not  changing  [88]  or  on 
trained  classifiers  to  detect  previously  observed  interacting  behaviors  [66,  49,  48,  69]. 

Building  upon  previous  work  we  wish  to  address  the  issues  that  arise  when  per¬ 
forming  dependence  analysis  on  data  in  which  the  structure  may  change  over  time. 
Furthermore  we  wish  to  do  so  in  the  absence  of  large  training  datasets.  Approaches 
that  rely  on  general  measures  of  statistical  dependence  are  powerful  in  that  they  do 
not  rely  on  any  application  specific  knowledge  and  can  be  adapted  easily  to  many  other 
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domains.  Previous  work  has  primarily  focused  on  the  use  of  windowed  approaches 
[68,  72].  That  is,  given  a  window  of  time  they  measure  dependence  assuming  it  is 
static.  By  sliding  this  window  over  time  such  approaches  attempt  to  capture  changing 
dependence.  When  using  such  an  approach  the  window  size  used  must  be  long  enough 
to  get  an  accurate  measure  of  dependence.  At  the  same  time  it  must  be  short  enough 
as  to  not  violate  the  assumption  of  a  static  dependence  relationship  being  active  within 
the  window.  That  is,  there  is  a  tradeoff  between  window  size  and  how  fast  dependence 
structures  can  change.  Approaches  that  use  training  data  have  the  advantage  of  being 
specialized  to  the  domain  of  interest  and  may  be  able  to  overcome  some  of  these  issues 
by  only  needing  a  few  samples  to  identify  a  particular  structure.  However,  domain 
specific  training  data  is  not  always  available. 

Our  core  idea  is  that  rather  than  treating  the  problem  as  a  series  of  windowed  tasks 
of  measuring  static  dependence,  we  cast  the  problem  of  dynamic  dependence  analysis 
in  terms  of  inference  on  a  dynamic  dependence  model.  A  dynamic  dependence  model 
explicitly  describes  dependence  structure  among  time-series  and  how  this  dependence 
evolves  over  time.  Using  this  class  of  models  allows  one  to  incorporate  knowledge  over 
all  time  when  making  a  local  decision  about  dependence  relationships.  Additionally,  it 
allows  one  to  take  advantage  of  repeated  dependence  relationships  from  different  points 
in  time.  We  demonstrate  the  advantage  of  this  approach  over  windowed  analysis  both 
theoretically  and  empirically. 

■  1.4  Key  Challenges  and  Contributions 

This  dissertation  addresses  a  set  of  key  challenges  and  makes  the  following  contributions: 

How  does  one  map  the  associations  in  the  audio-visual  task  and  the 
interactions  in  the  object  interaction  analysis  task  to  a  particular  sta¬ 
tistical  dependence  structure? 

We  introduce  two  general  static  dependence  models:  a  static  factorization  model 
(FactM)  and  temporal  interaction  model  (TIM).  The  FactM  describes  multiple 
time-series  as  sets  of  independent  groups.  That  is,  it  explicitly  models  how  the 
distribution  on  time-series  factorizes.  In  the  audio-visual  association  task,  reason¬ 
ing  over  associations  between  the  audio  and  video  streams  is  related  to  reasoning 
over  factorizations. 
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The  TIM  allows  one  to  provide  more  details  about  the  dependence  among  time- 
series.  The  model  takes  the  form  of  directed  Bayesian  network  describing  the 
causal  relationships  between  the  current  value  of  a  single  time-series  based  on  the 
finite  set  of  past  values  of  other  associated  time-series.  We  treat  the  problem  of 
identify  interactions  as  one  of  identifying  the  details  of  these  causal  relationships. 

How  many  possible  structures  are  there  and  is  it  tractable  to  consider 
all  of  them? 

The  number  of  possible  dependence  structures  among  N  time-series  is  super¬ 
exponential  in  N,  0(Nn).  For  many  applications,  such  as  the  audio-visual  as¬ 
sociation  task,  there  is  a  small  number  of  specific  structures  one  is  interested  in 
identifying  and  thus  inference  is  generally  tractable.  However,  in  other  applica¬ 
tions,  one  may  want  to  consider  all  structures  or  significantly  large  subset  that 
grows  with  N.  We  show  that  a  super-exponential  number  of  structures  can  be 
reasoned  over  in  exponential-time  in  general  by  exploiting  the  temporal  causality 
assumptions  in  our  TIM.  We  further  show  that  by  imposing  some  simple  restric¬ 
tions  on  the  structures  considered  one  can  reason  over  a  still  super-exponential 
number  of  them  in  polynomial-time. 

How  does  one  incorporate  prior  knowledge  about  dependence? 

In  some  applications  we  may  have  some  prior  knowledge  about  which  dependence 
structures  are  more  likely  than  others.  When  considering  a  tractable  set  of  struc¬ 
tures  a  simple  discrete  prior  distribution  can  be  placed  over  them.  However,  this 
approach  becomes  intractable  as  the  number  of  structures  considered  increases. 
Building  upon  the  type  of  priors  used  for  general  structure  learning  [62,  54,  33] 
we  present  a  conjugate  prior  on  structures  for  a  TIM  which  allows  for  efficient 
posterior  updates. 

When  analyzing  a  large  number  of  time-series,  trying  to  uncover  a  single  “true 
structure”  active  at  any  point  in  time  is  often  less  desirable  than  a  full  charac¬ 
terization  of  uncertainty.  Using  this  class  of  priors  we  obtain  a  distribution  over 
structure  rather  than  a  point  estimate.  In  addition,  the  exact  posterior  marginal 
distributions  and  expectations  can  be  obtained  tractably.  This  allows  for  a  de¬ 
tailed  characterization  of  uncertainty  in  the  statistical  dependence  relationships 
among  time-series. 
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How  does  one  separate  the  task  of  identifying  dependence  structure 
from  that  of  learning  the  associated  parameters? 

That  is,  an  issue  which  arises  when  inferring  dependence  relationships  is  how  to 
deal  with  unknown  parameters  which  describe  the  nature  of  dependence.  In  this 
dissertation  we  will  explore  two  approaches.  We  show  how  a  maximum  likelihood 
approach  can  be  used  to  obtain  a  point  estimate  of  the  parameters  from  the 
data  being  analyzed.  We  show  theoretically  and  empirically  how  this  approach 
causes  problems  for  windowed  dependence  analysis  techniques  while  it  can  help 
our  dynamic  dependence  approach.  Alternatively,  we  present  a  conjugate  prior 
over  parameters  of  a  TIM  and  show  how  to  tractably  perform  exact  Bayesian 
inference  integrating  over  all  possible  parameters. 

How  does  one  model  dependence  that  may  change  over  time? 

Building  on  the  large  body  of  work  on  dynamic  Bayesian  networks  (DBNs)  and 
hidden  Markov  models  (HMMs)  we  introduce  a  dynamic  dependence  model  which 
contains  a  latent  state  variable  that  indexes  structure.  In  this  dissertation  we 
assume  the  number  of  possible  structures  is  known  a  priori.  That  is,  the  number  of 
possible  latent  state  values  is  assumed  to  be  known.  A  simple  Markov  dynamic  is 
placed  on  this  latent  variable  to  model  how  structure  evolves  over  time.  Adopting 
such  a  model  allows  one  to  infer  changes  in  structure  via  the  use  of  standard 
forward-backward  message  passing  algorithms.  This  allows  one  to  reason  over 
an  exponential  number  of  possible  sequence  of  dependence  relationships  in  linear 
time. 

Additionally,  we  formulate  both  audio-visual  association  and  object  interaction  anal¬ 
ysis  tasks  as  special  cases  of  dynamic  dependence  analysis.  We  show  how  state-of-the-art 
performance  on  a  standard  dataset  for  audio-visual  speaker  association  can  be  achieved 
with  our  method.  We  demonstrate  the  utility  of  our  approach  in  analyzing  the  inter¬ 
actions  among  multiple  moving  players  participating  in  a  variety  of  games. 

■  1.5  Organization  of  this  Dissertation 

We  begin  in  Chapter  2  with  a  review  of  statistical  dependence,  graphical  models,  con¬ 
jugate  priors  and  time-series  models.  This  will  give  the  background  needed  for  the 
rest  of  the  dissertation.  Chapter  3  introduces  our  two  static  dependence  models  which 
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assumed  a  fixed  dependence  structure  over  all  time.  We  discuss  conjugate  priors  over 
structure  and  detail  inference  on  such  models  in  both  a  classical  maximum  likelihood 
and  Bayesian  framework.  Chapter  4  extends  these  models  to  become  dynamic  depen¬ 
dence  models  by  embedding  them  in  hidden  Markov  model.  We  show  theoretically  how 
such  models  have  a  distinct  advantage  over  windowed  approaches  for  dynamic  depen¬ 
dence  analysis.  Details  are  given  for  maximum  likelihood  inference  using  Expectation 
Maximization  and  Bayesian  inference  using  a  Gibbs  sampling  approach.  Chapters  5 
and  6  present  experiments  using  our  models  in  audio-visual  association  and  objective 
interaction  analysis  tasks  respectively.  We  conclude  with  a  summary  and  discussion  of 
future  work  in  Chapter  7. 


Chapter  2 


Background 


In  the  previous  chapter  we  introduced  the  general  problem  of  analyzing  dependence 
relationships  among  multiple  time-series.  In  this  chapter  we  give  a  brief  overview  of 
the  basic  technical  background  the  rest  of  the  dissertation  relies  on.  We  use  statistical 
models  to  describe  time-series  and  the  relationships  among  them.  That  is,  we  describe 
time-series  in  terms  of  random  variables  and  their  associated  probability  distributions. 
We  assume  the  reader  has  a  basic  understanding  of  probability  theory  which  can  found 
in  introductory  chapter  of  a  wide  variety  of  standard  textbooks  (c.f.  [5,  70]). 

We  begin  by  defining  statistical  dependence.  This  is  followed  by  a  review  of  graphical 
models  and  how  they  can  be  used  to  encode  the  structure  of  dependence  relationships 
among  a  set  of  random  variables.  Next,  in  Section  2.3,  we  discuss  time-series  in  terms 
of  discrete-time  stochastic  processes  and  present  the  use  of  a  simple  Markov  model  for 
describing  temporal  dependence.  In  Section  2.4  we  overview  the  standard  parametric 
models  which  will  be  used  in  this  dissertation  along  with  a  discussion  of  conjugacy, 
inference  and  learning.  Lastly,  we  briefly  present  select  topics  from  information  theory 
and  their  relation  to  the  problem  of  analyzing  statistical  dependence. 

The  intent  of  this  chapter  is  to  provide  a  brief  overview  of  the  selected  material.  We 
refer  the  reader  to  standard  text  books  for  a  more  rigorous  treatment  and  proofs  (c.f. 
[5,  70,  21,  73]). 

■  2.1  Statistical  Dependence 

We  start  by  defining  statistical  dependence.  Intuitively,  two  random  variables  are  de¬ 
pendent  if  knowledge  of  one  tells  you  something  about  the  other.  To  formalize  this 
abstract  concept  on  can  more  concretely  define  statistical  independence.  Two  random 
variables,  x  and  y,  are  statistical  independent  if  and  only  if  their  joint  distribution  is  a 
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product  of  their  marginal  distributions: 

P  (x>  y)  =  P  (x)  p  (y)  •  (2.1) 

The  notation  x  _LL  y  is  used  to  denote  statistical  independence.  Using  the  fact  that 
p(x,  y)  =  p(x)p(y|x)  =  p(y)p(x|y)  this  requirement  is  equivalent  to  saying  that  the 
conditional  distribution  of  one  random  variable  given  the  other  is  not  a  function  of  what 
is  being  conditioned  on: 


p(xy)=p(x)  and 

(2.2) 

p{  ylx)  =p{  y)  • 

(2.3) 

This  definitions  can  be  easily  extended  to  more  than  two  random  variables.  The  joint 
of  distribution  of  a  collection  of  independent  random  variables  factorizes  as  product  of 
their  marginals  and  all  conditional  distributions  are  independent  of  the  random  variables 
conditioned  on. 

A  simple  example  of  statistical  independence  is  when  x  and  y  are  the  results  of  coin 
tosses  of  two  separate  coins.  Each  coin  may  have  its  own  bias  /  probability  of  heads 
versus  tails,  but  if  they  are  two  physically  separate  coins  tossed  in  isolation,  knowledge 
of  whether  x  is  heads  or  tails  tells  you  nothing  about  the  outcome  y. 

Statistical  dependence  is  simply  defined  to  be  the  absence  of  statistical  indepen¬ 
dence.  That  is,  two  random  variables  are  dependent  if  their  joint  distribution  is  not 
the  product  of  their  marginal  distributions.  Going  back  to  a  simple  coin  example,  let 
x  be  the  result  of  a  fair  coin  toss  such  that  p  (x  =  heads)  =  p  (x  =  tails)  =  .5.  Given 
x,  imagine  placing  a  small  weight  on  a  second  fair  coin  so  that  it  is  biased  to  be  more 
likely  to  land  on  the  same  side  as  x.  Let  y  be  the  result  of  flipping  this  biased  coin.  To 
be  concrete,  assume  this  bias  favors  x  with  probability  .9,  p(y  =  heads|x  =  heads)  = 
p  (y  =  tails  |x  =  tails)  =  .9  and  p  (y  =  tails  |x  =  heads)  =  p(  y  =  tails  |x  =  heads)  =  .1. 
Here,  x  and  y  are  statistically  dependent.  Our  description  of  the  experiment  clearly 
indicates  how  knowledge  of  x  tells  us  something  about  y.  Using  Bayes’  rule  it  is 
easy  to  show  that  the  reverse  is  also  true  and  knowledge  of  y  also  tells  us  something 
about  x.  Specifically,  p  (x  =  heads|y  =  heads)  =  .9  *  .5/ (.9  *  .5  +  .1  *  .5)  =  .9  while 
p  (x  =  headsjy  =  tails)  =  .1. 

Another  important  concept  is  that  of  conditional  independence.  Two  random  vari¬ 
ables  x  and  y  are  conditionally  independent  given  another  random  variable  z,  x  _LL  y  |  z, 
if  and  only  if: 


P  (x,  y|z)  =  p  (x|z)  p  (yjz) 


(2.4) 


Sec.  2.2.  Graphical  Models 


29 


Independent 

p(xi)p(x2) 


Dependent 

P(x  i,x2) 


Figure  2.1.  Dependence:  Undirected  graphical  models  depicting  independent  and  dependent  distri¬ 
butions  for  two  random  variables. 


or  equivalently 


p(x|y,z) 

=  p(xz)  and 

(2.5) 

p(  y|x,z) 

=  p(  ylz) 

(2.6) 

It  is  important  to  note  that  conditional  independence  does  not  imply  independence  and 
vice  versa.  We  will  give  some  specific  examples  in  the  following  section  when  discussing 
Bayesian  networks. 

■  2.2  Graphical  Models 

In  the  previous  section  we  discussed  the  link  between  statistical  independence  of  ran¬ 
dom  variables  and  the  structure  of  their  joint  distribution.  Probabilistic  graphical 
models  combine  the  syntax  of  graph  theory  with  probabilistic  semantics  to  describe 
the  dependence  structure  among  a  set  of  random  variables.  Graphs  are  used  to  de¬ 
scribed  the  overall  joint  probability  of  the  random  variables  in  terms  of  local  functions 
on  subgraphs.  This  local  decomposition  allows  for  efficient  reasoning  over  the  random 
variables  represented.  More  importantly  for  this  dissertation,  graphical  models  provide 
a  convenient  way  to  specify  and  understand  conditional  independence  relationships  in 
terms  of  graph  topology.  For  example,  consider  the  two  undirected  graphical  models 
depicted  in  Figure  2.1.  It  is  intuitive  and  simple  to  read  dependence  information  from 
these  graphs. 

In  the  following  section  we  review  basic  graph  theory  to  establish  some  notation. 
We  then  discuss  different  classes  of  graphical  models,  each  of  which  uses  a  different 
graphical  representation  to  described  conditional  independence.  Again,  these  sections 
provide  a  quick  overview  and  we  refer  the  reader  to  standard  text  books  for  a  more 
detailed  treatment  [73]. 
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■  2.2.1  Graphs 

A  graph  G  =  {V,  E}  is  a  set  of  vertices/nodes  V  and  a  collection  of  edges  E.  Edges 
(i,j)  G  E  connect  two  different  nodes  i.  j  G  V.  Edges  can  be  either  undirected  or 
directed.  In  undirected  graphs,  the  edge  (i,j)  is  in  E  if  and  only  if  (j,  i)  is  as  well.  A 
vertex  j  is  a  neighbor  of  vertex  i  if  (i,j)  €  E  and  the  set  neighbors  for  vertex  i  is  simply 
the  collection  of  all  such  j  G  V.  A  path  is  defined  as  a  sequence  of  neighboring  nodes. 
If  there  is  a  path  between  all  vertices  the  graph  is  said  to  be  connected.  For  undirected 
graphs  a  clique  C  C  V  is  a  collection  of  vertices  for  which  all  pairs  are  have  an  edge 
between  them.  If  the  entire  graph  forms  a  clique  the  graph  is  said  to  be  complete. 

In  a  directed  graph,  directed  edges  (i,j)  connect  a  parent  node  i  to  its  child  j. 
Throughout  the  dissertation  we  will  use  the  notation  E  to  represent  a  set  of  directed 
edges  and  will  represent  them  as  arrows  in  figures.  For  example,  see  Figure  2.2(b). 
The  set  of  all  parents  pa  (i,  E^j  for  a  particular  node  i  is  the  set  of  all  j  G  V  such  that 
(i,j)  ^  E.  For  brevity  we  will  often  drop  the  E  and  refer  to  this  set  as  pa(i). 

A  useful  generalization  of  a  graph  is  a  hypergraph.  A  hypergraph  H  =  {V,F}  is 
composed  of  vertices  V  and  associated  hyperedges  F.  Each  hyperedge  /  E  F  is  specified 
by  any  non-empty  subset  of  V.  That  is,  each  hyperedge  can  connected  one  or  more 
vertices.  Two  vertices  are  neighbors  in  this  hypergraph  if  they  belong  to  a  common 
hyperedge.  We  represent  a  hypergraph  as  a  graph  with  extra  square  nodes  for  each 
hyperedge.  All  vertices  which  belong  to  that  hyperedge  are  connected  to  the  associated 
square  node.  This  makes  the  graph  bipartite  in  that  edges  only  occur  between  vertices 
and  the  square  hyperedge  nodes.  See  Figure  2.2(c)  for  an  example. 

■  2.2.2  Factor  Graphs 

Factor  graphs  are  a  class  of  graphical  models  which  use  a  hypergraph  H  to  specify  the 
form  of  the  joint  probability  distribution  on  a  set  of  random  variables  [58].  Let  x(/ 
represent  the  set  of  all  random  variables  {xji  £  H).  Given  a  set  of  hyperedges  F,  a 
factor  graph  represents  the  joint  distribution  p  (xy)  as 

Mxv)  =  \  ft  ^/(x/)>  (2-7) 

f&F 

where  each  ipf  is  a  non-negative  function  of  their  arguments  and  Z,  often  referred  to  as 
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Figure  2.2.  Equivalent  Graphical  Models:  Three  graphical  models  using  different  representations  to 
describe  the  joint  distribution  of  four  random  variables,  (a)  A  markov  random  field  (b)  a  directed 
Bayesian  network,  (c)  a  factor  graph,  (d)  a  second  factor  graph  with  the  same  neighbor  relationships 


the  partition  function ,  guarantees  proper  normalization.  That  is, 

z = n  ^/(x/)  (2-s) 

xv  f&F 

Figure  2.2(c)  shows  an  example  in  which 

p(xi,x2,x3,x4)  =  “'01,2,3 (xq  j  X2 ,  X3)'i/’3i4 (x3 ,  X4) .  (2.9) 

In  the  equation  above,  and  in  general,  0/(x^)  does  not  correspond  to  the  marginal 
distribution  p  ^x^  •  However,  in  this  dissertation  we  will  use  factor  graphs  in  situations 
in  which  each  random  variable  belongs  to  a  single  hyper  edge  and  the  if  functions 
correspond  to  marginal  distributions. 

Conditional  independence  information  can  be  read  directly  from  a  factor  graph. 
Any  two  random  variables  x-  and  x;-  are  conditional  independent  given  a  set  xz  for 
Z  C  V  if  every  path  between  x?:  and  x  ■  in  the  hypergraph  passes  through  some  vertex 
k  €  Z.  Two  random  variables  are  marginally  independent  if  there  is  no  path  between 
them.  In  Figure  2.2(c),  x4  _LL  x4  |  x3.  Using  this  fact  it  is  simply  to  see  that  x-  is 
independent  of  all  other  random  variables  given  its  neighbors  in  the  factor  graph. 

■  2.2.3  Undirected  Graphical  Models 

Undirected  graphical  models,  generally  referred  to  as  Markov  Random  Fields  (MRFs), 
encode  conditional  independence  relationships  in  a  graph  G  =  {V,  E}.  Closely  related  to 
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factor  graphs,  MRFs  represent  the  joint  distribution  as  a  product  of  potential  functions 
on  cliques  c: 

p  (xv)  =  JJ  0C(XC)  (2.10) 

C 

where,  again,  Z  is  the  normalization  constant.  MRFs  have  the  same  conditional  in¬ 
dependence  properties  as  factor  graphs  in  that  two  random  variables  and  x^-  are 
conditionally  independent  given  a  set  of  random  variables  that  must  be  passed  through 
for  every  path  between  xi  and  x;-. 

Figure  2.2(a)  shows  an  MRF  with  the  same  conditional  dependence  relationships 
as  the  factor  graph  in  Figure  2.2(c).  In  fact,  in  general,  there  are  many  factor  graphs 
which  represent  the  same  conditional  independence  relationships  for  a  given  MRF. 
Figure  2.2(d)  shows  a  second  factor  graph  which  has  the  same  dependence  properties  as 
Figure  2.2(a).  Factor  graphs  have  more  flexibility  in  that  they  can  explicitly  specify  how 
potentials  on  cliques  factor,  further  constraining  the  space  of  possible  joint  distributions 
represented.  This  flexibility  comes  from  their  more  general  hypergraph  specification. 

■  2.2.4  Bayesian  Networks 

Bayesian  networks  are  graphical  models  which  can  encode  causality  via  a  directed  rep¬ 
resentation.  Given  a  directed  graph  G  =  {V,  E}  a  Bayesian  network  represents  a  joint 
probability  which  factors  as  a  product  of  conditional  distributions: 

P(*v)  =  nKx*lxPa(i))  '  (2-n) 

i£V 

If  a  node  has  no  parents,  pa  (i)  =  0,  the  we  define  p  (xjxg)  =  p  (xj.  This  factorization 
is  only  valid  when  E  is  acyclic.  That  is,  it  represents  a  graph  in  which  there  is  no 
directed  path  from  one  node  back  to  itself.  Figure  2.2(b)  depicts  a  Bayesian  network 
with: 


p(xi,x2,x3,x4)  =  p  (x4)  p  (x2)  p  (x3 |xx ,  x2)  p  (x4 |x3)  (2.12) 

This  decomposition  exposes  an  underlying  generative  process. 

Reading  conditional  independence  from  a  Bayesian  network  is  more  complicated 
than  for  undirected  models  and  factor  graphs.  The  Bayes  Ball  algorithm  provides  a 
way  of  extracting  independence  relationships  from  Bayesian  network  structure  [79]. 
One  useful  property  is  that  a  random  variable  is  independent  of  all  others  conditioned 
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on  its  parents,  its  children,  and  its  children’s  parents.  A  Bayesian  network  can  be 


moralized  into  an  MRF  by  connecting  the  parents  of  each  node  with  an  undirected 
edge  and  replacing  all  directed  edges  with  undirected  ones.  Figure  2.2(a)  is  a  moralized 
version  of  Figure  2.2(b). 

Bayesian  networks  can  explicitly  represent  certain  statistical  dependence  relation¬ 
ships  factor  graphs  and  MRFs  cannot.  For  example,  consider  the  V  structure  found 
between  Xj  ,x3  and  x3  in  2.2(b).  A  classic  example  with  this  dependence  structure  is  a 
scenario  when  x1  and  x2  represent  the  outcomes  of  two  independent  coin  tosses.  Let 
x3  be  an  indicator  of  whether  x1  and  x2  had  the  same  outcome.  Causally  x1  and  x2 
are  independent,  x1  _LL  x2,  but  conditioning  on  knowing  x3  they  become  dependent. 
Knowing  whether  the  coin  tosses  where  the  same  or  not  and  the  outcome  of  x4  tells  you 
a  lot  about  x2.  In  order  to  capture  this  relationship  in  an  MRF,  the  three  random  vari¬ 
ables  must  form  a  clique  (adding  an  edge  between  x1  and  x2).  This  is  a  case  in  which 
independence  does  not  imply  conditional  independence.  The  reverse  is  true  for  x4  and 
x4  in  Figure  2.2(b).  They  are  statistical  dependent  but  are  conditionally  independent 
given  x3,  x4  X  x4|  x3. 

■  2.3  Time-series 

So  far  in  this  chapter  we  have  been  talking  about  statistical  dependence  among,  and 
distributions  for,  general  sets  of  random  variables.  In  this  dissertation  we  a  interested 
in  the  dependence  among  time-series.  A  time-series  can  be  modeled  as  a  discrete  time 
stochastic  process  whose  value  at  time  t  is  represented  by  a  random  variable,  xt.  A 
stochastic  process  is  fully  characterized  by  all  joint  distributions  of  the  process  at  dif¬ 
ferent  points  in  time.  A  discrete  stochastic  process  over  a  finite  interval  from  1  to  T  is 
completely  specified  in  terms  of  its  joint  p  (x4,  x2, . . . ,  xT).  We  will  often  use  the  nota¬ 
tion  x4.T  to  denote  such  a  sequence.  If  T  is  not  fixed,  this  distribution  would  have  to  be 
specified  for  all  possible  T  one  expects  to  encounter.  Such  an  approach  is  not  tractable, 
and  thus  it  is  common  to  make  certain  assumptions  about  the  temporal  dependence. 

One  simple  model  for  time-series  is  to  consider  each  time  point  to  be  independent 
and  identically  distributed  (i.i.d.)  such  that: 


T 


(2.13) 


t=  l 


While  easy  to  represent,  this  approach  does  not  capture  the  fact  that  information  from 
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(a)  First  order 


(b)  Second  order 


Figure  2.3.  Markov  Models  for  Time-Series:  Directed  Bayesian  network  (DBN)  (a)  represents  a  first 
order  model  and  DBN  (b)  represents  a  second  order  model. 


the  past  can  help  predict  future  time-series  values.  A  Markov  model  is  another  tractable 
model  which  can  capture  some  of  this  dependence.  Consider  the  fact  that  the  full  joint 
distribution  can  always  be  factorized  as 

T 

P  (X1  :T)  =  (Xt lXl:t— l)  •  (2'14) 

t=l 

An  r-th  order  Markov  model  is  obtained  if  one  assume  the  right-hand  side  is  only 
dependent  on  the  previous  r  time  points.  A  first  order  model  is  simply 

T 

p(xtt)  =  P  (xi)  11^  (xilxt-i)  •  (2-15) 

t= 2 

and  is  depicted  as  a  directed  Bayesian  network  in  Figure  2.3(a).  A  second  order  model 
is  shown  in  Figure  2.3(b).  Note  that  a  zero  order  model  is  simply  i.i.d. 

We  will  use  this  class  of  models  for  our  time-series  throughout  the  dissertation. 
Note,  here  we  are  describing  a  single  time-series  x1.T.  In  Chapter  3  we  will  discuss 
models  for  jointly  describing  multiple  time-series  in  which  the  temporal  dependence  is 
fixed  but  the  dependence  across  time-series  is  unknown. 


■  2.4  Parameterization,  Learning,  Inference  and  Evidence 

In  the  previous  sections  we  discussed  statistical  dependence  and  how  it  can  be  encoded 
by  a  graphical  model.  We  discussed  how  these  models  specify  the  form  of  the  joint 
distribution  in  terms  of  local  conditional  probability  distributions  or  potential  func¬ 
tions.  Implicitly  each  of  these  local  distributions  and  potential  functions  has  a  set  of 
parameters  associated  with  it.  The  parameters  fully  specify  its  form.  So  far,  we  have 
been  hiding  these  parameters  in  the  notation. 

Here  will  make  this  parameterization  explicit  and  use  the  notation  p(x|0)  rather 
than  p  (x).  Similarly  for  conditional  distributions  we  will  use  p  (x|y,  ©x|y)  rather  than 
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0 

T 


(a) 

Figure  2.4.  Deterministic  vs  Random  Parameters:  Two  graphical  models  depicting  two  different 
treatments  of  parameters,  0 

(a)  treats  0  as  an  unknown  a  deterministic  quantity,  (b)  treats  0  as  an  unobserved 
random  variable  whose  prior  distribution  is  specified  by  known  hyperparameters  T. 


p(x|y).  For  example,  a  more  explicit  representation  for  a  directed  Bayesian  network 
on  a  set  of  random  variables  xr  is 

p(xv|0)  =  nHx>|xpa«^lPa(d)  >  (2-16) 

i&V 

where  ©  contains  0j|pa(j)  for  all  i  G  V.  Note  that  the  actual  parameterization  is 
still  implicit  in  this  notation.  Only  the  parameter  values  are  explicitly  denoted  by 
0.  For  example,  we  must  first  say  p(x|0)  is  a  Gaussian  distribution,  before  one  can 
identify  that  the  parameters  0  describe  the  value  of  the  mean  and  covariance  for  that 
distribution. 

It  is  also  important  to  point  out  that,  while  this  notation  explicitly  specifies  param¬ 
eters  with  0,  the  structure  described  by  E  in  Equation  2.16  is  implicit.  To  be  fully 
explicit  one  should  use  the  notation  p  (xr|E,  0)  instead.  In  future  chapters  of  this  dis¬ 
sertation  we  use  this  notation  and  focus  on  reasoning  about  this  dependence  structure. 
However,  in  this  section,  for  simplicity,  we  will  leave  the  structure  implicit  or  consider 
it  part  of  0. 

Given  a  parameterization  and  parameter  values  one  can  calculate  the  probability  of 
an  observation  of  x.  However,  throughout  this  dissertation  we  only  be  given  observa¬ 
tions  of  x  without  knowledge  of  the  true  underlying  parameters.  We  can  deal  with  this 
situation  in  one  of  three  possible  ways. 

First,  treating  the  parameters  0  as  deterministic  unknown  quantities,  we  can  at¬ 
tempt  to  learn  them  from  the  observed  data.  A  common  approach  to  learning  is  to 
maximize  the  likelihood  of  0.  That  is,  if  we  denote  T>  =  {Pi,  D2,  ...Vjy}  as  N  indepen- 
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dent  observations  of  the  random  variable  x  we  wish  to  find: 


0  =  argmaxp  (D|0)  (2-17) 

e 

N 

=  argmax  TT  p  (x  =  Vn\@)  (2.18) 

0  n=l 

We  can  also  treat  0  as  just  another  random  variable.  A  prior,  po  (0|Y)  can  be  placed 
on  0  to  capture  our  prior  belief  on  the  value  of  0.  Here,  Y  is  a  set  of  hyperparameters 
used  to  specify  this  prior  belief.  This  is  depicted  in  the  graphical  model  shown  in 
Figure  2.4(b).  In  this  figure  rounded  rectangle  nodes  are  used  to  denote  deterministic 
quantities  and  shaded  circular  nodes  indicate  what  is  observed  in  V. 

Given  observed  data,  V,  one  can  then  calculate  the  posterior  on  0.  A  second 
approach  to  deal  with  the  unknown  parameters  is  to  find  the  maximum  a  posteriori 
(MAP)  0  using  this  posterior: 


0*  =  argmaxp  (0|D)  (2.19) 

© 

=  argmaxp  (2?|0)po  (0|Y)  (2.20) 

© 

Calculating  the  posterior  and/or  calculating  the  MAP  estimate  can  be  difficult  depend¬ 
ing  on  the  form  of  the  chosen  prior  po  (0|Y).  However,  the  optimization  and  general 
calculation  of  the  posterior  is  simplified  greatly  when  a  conjugate  prior  exists  and  is 
used.  A  family  of  priors  specified  by  po  (0|Y)  is  conjugate  if  the  posterior  p(0 \T>,  Y) 
remains  in  the  family.  That  is,  it  is  conjugate  if  p(Q\V,  Y)  =  po  (0|Y)  and  Y  is  a 
function  of  T>  and  the  original  Y. 

A  third  approach  is  to  marginalize  over  the  parameters  0  and  use  the  evidence  when 
calculating  probabilities: 


p(V)=  [  p(V\e)p0(@\T)d@  (2.21) 

Je 

Integrating  over  the  space  of  all  parameters  is  difficult  in  general.  However,  again, 
having  a  conjugate  prior  allows  for  tractable  computation  since  the  evidence  can  be 
written  as: 


p{V) 


p(P|0)po(0|Y) 

p(0|P,Y) 


p(P|0)po(0|Y) 
po  (0|Y) 


(2.22) 


In  the  following  sections  we  will  overview  specific  parameterized  families  of  distribu¬ 
tions,  ML  estimates  of  their  parameters,  corresponding  conjugate  priors  and  detail  how 
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to  calculate  evidence.  Each  of  the  distributions  presented  belongs  to  the  exponential 
family  of  distributions  [3] . 

■  2.4.1  Discrete  Distribution 

Let  x  be  a  discrete  random  variable  taking  on  one  of  K  possible  values  in  the  set 
{1 ,K}.  A  discrete  distribution  is  a  probability  mass  function  with  probability  it 
that  x  takes  on  value  k: 


p  (x|0)  =  Discrete  (x;  it) 
=  TTx 

= n 

k= 1 


(2.23) 

(2.24) 

(2.25) 


where  0  =  {tt\,  . . .  ,ttk}  and  5(u)  is  the  discrete  delta  function  taking  a  value  of  1  if 
u  =  0  and  0  otherwise.  Given  N  samples  of  x  as  V  =  {T>\, . . .  ,T>]y},  the  ML  estimate 
of  0  is  simply: 


0  =  argmaxp(D|0)  =  {tti,  . .  .,ttk}  =  {^,  •  •  • ,  ^}  (2.26) 

where  n k  is  a  count  of  the  number  data  points  which  had  value  k,  |{  i  |  T>i  =  k}\. 


Dirichlet  Distribution 

The  conjugate  prior  for  a  discrete  distribution  is  the  Dirichlet  distribution1.  Given 
hyperparameters  T  =  a  =  {au, . . .  ,«&}  the  Dirchlet  distribution  has  the  form: 


Po(©|T) 


p{Tr\a)  =  Dir  (7Ti, . . . ,  ttk;  a\, . . . ,  aj.) 


=  T(J2kak)  rr  ak- 1 

"  nfcr(afc)  11  fc 


(2.27) 

(2.28) 


1The  Dirichlet  distribution  is  normally  said  to  be  conjugate  for  the  multinomial  distribution.  Given 
N  discrete  random  variables  each  taking  K  values,  the  multinomial  is  a  distribution  on  the  counts  nk 
rather  than  the  particular  sequence.  The  Dirichlet  is  also  conjugate  for  the  simple  discrete  distribution, 
used  here,  which  models  the  sequence  of  N  independent  observations  rather  than  just  the  counts. 
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A  uniform  distribution  can  be  obtained  by  setting  all  ak  =  1.  It  is  simple  to  see  why 
this  distribution  is  conjugate.  The  posterior  on  the  parameters  0,  given  data  is: 


p(@\V,T)cxp(V\&)p0  (©|T) 

K 

oc  Yl  nak+n*-1 

k= 1 


oc  Dir  (0;  aq  +  n,\, . . . ,  ax  +  nx) 


(2.29) 

(2.30) 

(2.31) 


or  can  equivalently  be  written  as  p  (0|X>,  T)  =  po  (0|T)  where  T  =  {aq  +  m, . . . ,  ax  +  nx}- 
The  evidence  is  simply: 


P(V  IT) 


p(2?|0)po(0|T) 

P(0P,T) 

^(Ekak)Ukr(ak+nk) 

r(iv  +  Efc«fe)n,rK) 


(2.32) 

(2.33) 


The  hyperparameters  ak  can  be  thought  of  as  a  prior/pseudo  count  of  how  many  times 
one  saw  the  value  K  in  E&  ak  trials.  Figure  2.5  shows  several  Dirichlet  distributions 
using  different  hyperparameters. 


■  2.4.2  Normal  Distribution 

Perhaps  the  most  commonly  used  distrib  ution  for  continuous  valued  random  vectors 
x  whose  values  are  in  Rd  is  the  Gaussian  or  normal  distribution.  Given  parameters 
0  =  {fi,  A}  it  takes  the  form: 

p  (x|0)  =  J\f  (x;  /r,  A)  (2.34) 

=  (2^/2 1 A|  V2exP  {j><x  -  riTW‘(*  -  rt}  (2.35) 

where  p  £  Rd  is  the  mean  and  A  S  j{dxd  is  a  dx  d  positive  definite  matrix  representing 
the  covariance  among  the  elements  in  x. 

Given  N  independent  samples  of  x  in  V,  the  ML  estimate  of  the  Gaussian’s  param¬ 
eters  are  the  sample  mean  and  covariance: 

1  N  \  N 

P  =  )  A  =  -jy  'y  'j^Pn  —  ftPi'Dn  ~  A) 

n= 1  n= 1 


(2.36) 
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(a)  Dir(7r;  1, 1, 1) 


(b)  Dir  (7r;  2,  2,  2) 


(c)  Dir  (tt;  4,  2,  2) 


(d)  Dir  (7r;  .5,  .5,  .5) 


Figure  2.5.  Example  Dirichlet  Distributions:  Distributions  for  K  =  3  are  displayed  on  the  simplex 
(rri,  7T2, 1  —  7Ti  —  7T2 ) .  Dark  represents  high  probability,  (a)  Uniform  prior,  (b)  Prior  biased  toward  equal 
TTkS  (c)  Prior  biased  toward  K  =  1,  (c)  By  setting  a*,  <  0  one  obtains  a  prior  that  favors  a  sparse  n. 


Normal-Inverse-Wishart  Distribution 

The  conjugate  prior  for  the  normal  distribution  is  the  normal-inverse-Wishart  distribu¬ 
tion.  It  factorizes  as  a  normal  prior  on  the  mean  given  the  covariance  and  an  inverse- 
Wishart  prior  on  the  covariance  given  hyperparameters  T  =  {w,  k,  S,  zz}: 


Po  (0|T)  =  p0  (m|A,T)po  (A|T) 

=  M  A/k)  IW  (A;  H,  u) 


(2.37) 

(2.38) 
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Here  the  conditional  prior  on  p,  is  a  Gaussian  with  an  expected  value  of  u  £  Rd  and 
covariance  scaled  by  k  €  R1.  The  hyperparameter  k  can  be  thought  of  a  count  of  prior 
“pseudo-observations.”  The  higher  n  is  the  tighter  the  prior  on  p  becomes  around  its 
mean  u.  The  d-dimensional  inverse- Wishart  distribution  with  positive  definite  param¬ 
eter  H  €  Rdxd  and  v  G  R1  degrees  of  freedom  has  the  form: 


|~r/2|A|-I^exp{-±tr(SA-1)} 

IW  (A;  H,  v)  =  — - V - i - — 

2!f7Td(d-i)/AYid=1r(^±i=i) 

The  expected  value  of  this  distribution  is  simply  r./{y  —  d  —  1).  It  can  be 
given  N  independent  Gaussian  observations  the  posterior  distribution  has 


(2.39) 

shown  that, 
the  form: 


p  (e\V,  T)  =  AT  (//;  <D,  A/k)  IW  (A;  S,  v)  (2.40) 

where  the  updated  hyperparameters  are 


k  =  k  +  N 
v  =  v  +  N 


uj  = 


K  1  ^ 

- w  H - > 

>t  +  IV  k  +  N ^  n 

n=  1 
N 

—  -)-  ^  ( 'T)nT)n^  -{-  . 

n= 1 


(2.41) 

(2.42) 

(2.43) 

(2.44) 


Figure  2.6  shows  an  example  normal-inverse-Wishart  prior  and  an  associated  posterior 
given  samples  drawn  from  a  normal  distribution. 

Integrating  out  parameters,  the  evidence  takes  the  form: 

r d  t-i/V-.  ,  at  ,  i  /o\  /  ..  \d/ 2  plv/2 

Vk  +  IvJ  TrJVd/2 1  s  |  (v-t-JV)/2 


p(V\T) 


ntir((^  +  iv  +  i-»)/2)/ 


(2.45) 


n-=1r((^  + 1-0/2) 

This  is  an  evaluation  of  a  multivariate- T  distribution.  We  will  show  this  is  a  special 
case  of  the  more  general  matrix- T  distribution  in  the  next  section. 


■  2.4.3  Matrix-Normal  Distribution 

Next  consider  the  case  when  one  needs  to  model  a  conditional  distribution  p  (y  |x)  where 
y  €  Rd  and  x  G  Rm.  The  normal  distribution  can  be  generalized  to  model  a  linear 
Gaussian  relationship  such  that 


y  =  Ax  +  e, 


(2.46) 
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(a)  Prior 


(b)  Observed 


(c)  Posterior 


Figure  2.6.  Examples  of  the  Normal-Inverse-  Wishart  distribution:  (a)  Shows  10  2-d  normal  distri¬ 
butions  with  their  parameters  samples  from  a  normal- inverse- wishart  prior  JV  (fi;  0,  2A)  IW  (A;  I2,  5). 
Black  dots  represent  the  mean  and  the  covariance  is  plotted  as  an  ellipse,  (b)  Samples  from  a  normal 
distribution  with  mean  [1  1]T  and  covariance  I2.  (c)  Gaussians  with  their  parameters  sampled  from  the 
posterior  on  parameters  given  the  samples  in  (b). 


where  A  £  and  the  random  variable  e  £  Rd  is  drawn  from  a  zero  mean  normal 

distribution  with  covariance  A.  The  conditional  distribution  is  thus 


P  (y|x,  0)  =  M{ y;  Ax,A) . 


(2.47) 


Under  this  model  the  ML  parameter  estimates  given  N  observations  of  y  and  x,  T>y 
and  T>x  are 


A= 

1  N  T 

A  — (V,n~AVi) 

n= 1 


(2.48) 

(2.49) 


Matrix- Normal-1  nverse- Wishart  Distribution 

The  conjugate  prior  for  this  model  is  a  generalization  of  the  normal-inverse-Wishart 
distribution.  Given  T  =  {I2,K,H,K,  u}  a  matrix-normal-inverse-Wishart  distribution 
has  the  form 


p0  (0|T)  =  MM  (A;  n,  A,  K)  IW  (A;  H,  v) 


(2.50) 
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where  the  matrix-normal  distribution  is 

MN  (A;  fi,  A,  K)  =  exp  |-|tr  ((A  -  r2)TA“1(A  -  0)k)  }  (2.51) 

and  n  G  Rdxm,  K  G  Rmxm.  If  vec  (A)  represents  the  columns  of  A  stacked  into  a 
vector,  then  it  is  normally  distributed  as 


p  (vec  (A)  \Q,  A,  K)  =  M  (vec  (A) ;  vec  (11) ,  A  <g>  K  x) 

where  <8>  is  the  Kronecker  product. 

Given  N  independent  observations,  the  posterior  is 

p  (e\Vy,  Vx,  T)  =  MM  (A;  fi,  A,  K)  IW  (A;  S,  p) 

where 


K  =  K  +  N 
11  =  YlytXYlxx 

with 

N 

^x,x  =  Y,VnVXn  +K 

ra=l 

N 

^y,y  =  Y,VynVVn  +toKnT 

n=  1 


p  =  z,  +  AT 

S  —  ^  +  ^y\x 


N 

sv,*  =  Ep^T  +  nK 

ra=l 


-,y|x 


=  E 


v,y 


—  V  V"1  v  i 


The  evidence  is  calculated  via 


_  niir((^  +  A  +  l-z)/2)  |K|rf/2  |"|^/2 

ntir((^  +  1-z)/2)  |Sa)*|d/27rJVd/2|S|(*H-JV)/2 

=  MT  (V;  S,  Ijv  -  23®TE-iP®,  i/  +  iv) 


(2.52) 


(2.53) 

(2.54) 

(2.55) 

(2.56) 

(2.57) 


(2.58) 

(2.59) 


where  T>x  is  being  used  as  a  m  x  n  matrix,  I&  represents  a  k  x  k  identity  matrix  and 
MT  (A;  M,  V,  K  ,  n)  is  a  matrix-T  distribution  which  has  the  form: 

=1  T  ((n  +  1  —  z)/2)  |K|d/2 

T  ((n  —  m  +  1  —  i)/2)  |vrV|m/2 

Note  that  one  obtains  a  normal  and  corresponding  normal-inverse-wishart  distribution 
for  the  case  in  which  m  =  l,x  =  1  and  A  =  p,. 


(A  -  M) 1  V_1(A  -  M)K  +  I„ 


11  j  Z 


(2.60) 
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■  2.5  Select  Elements  of  Information  Theory 

Information  theory  provides  tools  for  quantifying  uncertainty  and  statistical  depen¬ 
dence.  In  addition,  it  is  closely  connected  to  statistical  inference  and  learning  [59]. 
Here,  we  provide  a  quick  overview  of  the  select  elements  of  information  theory  we  will 
use  in  this  dissertation.  We  point  the  reader  to  Cover  and  Thomas  [21]  for  more  details. 

Shannon’s  entropy  provides  a  measure  of  information  and  inherent  uncertainty  in  a 
random  variable.  For  a  discrete  random  variable  with  K  possible  values  and  distribution 
p  (x) ,  entropy  is  defined  as 

K 

H  (*)  =  P(k)  log p(k)  (2.61) 

k=  1 

The  entropy  is  maximized  when  p  (x)  =  1/K  is  a  uniform  distribution,  corresponding  to 
the  most  uncertainty.  For  continuous  random  variables,  an  extension  of  this  definition 
is  that  of  differential  entropy: 


H  (x)  =  —  J  p  (x)  logp  (x)dx 

(2.62) 

Joint  and  conditional  entropy  can  be  defined  similarly: 

H(x,y  )  =  ~J  J  p(x>y)logp(x>y)  d*dy 

(2.63) 

H(y\x)  =  -J  j  p  (xi  y)  i°gp  (y  x)  rfxdy 

(2.64) 

It  is  important  to  note  that  conditional  entropy  is  obtained  via  an  expectation  over  the 
full  joint.  It  can  alternatively  be  defined  in  terms  of  an  expectation  of  the  entropy  of 
one  variable  given  a  particular  value  of  the  other. 

H  (y  x)  =  Ep(x)  [H  (y  x  =  x)] 

(2.65) 

=  j  p  (x  =  x)  H  (y  x  =  x)  dx 

(2.66) 

=  —  J  p(~x  =  x)  j  p  (y  x  =  x)  logp  (y  x  =  x)  dydx 

(2.67) 

=  ~  j  J P  (x,  y)  logp  (y  x)  dydx 

(2.68) 

With  these  definitions  it  is  straightforward  to  show  that  the  joint  uncertainty  can 
be  expressed  as: 


H  (x,  y)  =  H  (x)  +  H  (y|x) 


(2.69) 
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and  that  conditioning  is  guarantied  not  to  increase  uncertainty: 


H  (y)  >  H  (y|x)  (2.70) 

Using  this  property  one  obtains  a  simple  understanding  of  mutual  information  (MI): 

I  (x;  y)  =  H  (y)  —  H  (y|x)  (2.71) 

=  H  (x)  —  H  (x|y)  (2.72) 


That  is,  MI  measures  the  decrease  in  uncertainty  in  one  random  variable  when  condi¬ 
tioning  on  the  other.  For  independent  random  variables  p  (y|x)  =  p  (y),  thus  H  (y|x)  = 
H  (y),  making  the  MI  go  to  0. 

MI  can  be  expressed  as  a  Kullback-Leibler  (KL)  divergence.  KL  divergence  measure 
the  relative  entropy  between  two  probability  distributions  p  (x)  and  q  (x) : 


D 


p(x) 


=  j  P  (x)  log 


=  E. 


'p(x) 


log 


ff(X) 

p(x) 

9(x) 


dx 


(2.73) 

(2.74) 


Similar  to  conditional  entropy,  the  KL  divergence  between  two  conditional  distributions 
is: 


D[p (y lx)  ||  q (y lx)  )  =  J p (x> y) log ^ j*j dydx  (2.75) 

Consider  a  binary  hypothesis  test  for  choosing  a  distribution  on  x  in  which  hypoth¬ 
esis  H 1  and  H2  are  defined  as: 


H\  :  x  ~  p  (x)  (2.76) 

H2  :  x  ~  q  (x)  (2.77) 

Assuming  equation  priors,  p  (Hi)  =  p  (H2),  a  log  likelihood  ratio  test  takes  the  form: 

Hi 

^1,2  —  log  >  0  (2.78) 

g(x)  < 

H> 


It  is  easy  to  see  the  intimate  connection  between  this  hypothesis  test  and  KL  divergence. 
In  expectation  under  Hi  the  log  likelihood  ratio  is  simply  a  measure  of  relative  entropy. 
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Two  alternative  views  of  MI  emerge.  It  can  be  thought  of  as  the  KL  divergence  be¬ 
tween  the  joint  distribution  of  two  random  variables  and  the  product  of  their  marginals, 


(2.79) 


or  simply  in  terms  of  a  hypothesis  test  on  the  presence  of  statistical  dependence. 

■  2.6  Summary 

We  defined  statistical  dependence  and  discussed  how  it  can  be  encoded  in  a  graphical 
model.  Markov  models  for  time-series  were  presented  as  stochastic  processes  with  a 
fixed  temporal  dependence  structure.  Additionally,  we  presented  a  select  set  of  para¬ 
metric  families  of  marginal  and  conditional  distributions,  along  with  conjugate  priors 
which  allow  for  efficient  learning,  inference  and  evidence  calculations.  Lastly,  we  briefly 
overviewed  information  theory  and  showed  how  entropy  and  KL  divergence  can  be  used 
to  provide  a  measure  of  uncertainty  and  statistical  dependence. 

This  dissertation  will  build  upon  the  basic  foundation  this  chapter  has  provided. 
In  the  next  chapter  we  present  graphical  models  for  describing  the  dependence  among 
multiple  time-series.  We  focus  on  understanding  this  dependence  given  a  set  of  ob¬ 
servations  and  use  connections  to  information  theory  to  characterize  the  difficulty  of 
a  structural  inference  task.  In  addition,  we  will  show  how  conjugate  priors  allow  for 
tractable  inference  and  characterization  of  uncertainty  in  structure. 
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Chapter  3 


Static  Dependence  Models  for 

Time-Series 


In  this  chapter  we  present  the  underlying  probabilistic  models  we  use  to  describe  de¬ 
pendent  time-series.  While  our  ultimate  aim  is  to  perform  dependence  analysis  in  a 
dynamic  setting,  it  is  important  to  first  understand  the  more  restricted  static  case. 
Given  observations  of  multiple  time-series  over  a  set  of  times,  the  goal  of  a  static  de¬ 
pendence  analysis  task  is  to  describe  the  relationships  among  these  observed  time-series 
in  terms  of  a  fixed  dependence  structure.  Here,  we  examine  static  dependence  analysis 
in  detail  and  cast  the  problem  as  inference  over  structure  in  a  static  dependence  model. 
A  static  dependence  model  is  a  probabilistic  model  which  describes  the  evolution  of 
multiple  time-series  in  addition  to  the  relationships  among  them  in  terms  of  a  common 
dependence  structure  over  all  time.  It  is  important  to  note  that,  while  the  framework 
developed  here  can  be  used  to  learn  predictive  models  for  classification  and  examination 
of  time-series  observed  in  the  future,  our  primary  focus  is  on  structure  discovery  as  a 
tool  for  data  analysis. 

We  begin  by  establishing  some  standard  notation  which  will  be  used  for  the  remain¬ 
der  of  the  dissertation.  A  general  static  dependence  model  for  time-series  is  presented 
along  with  a  discussion  of  the  key  challenges  for  inference  using  these  models.  Next, 
in  Section  3.3  a  specific  static  dependence  model  is  introduced  which  describes  the 
dependence  among  multiple  time-series  in  terms  of  sets  of  independent  groups.  Us¬ 
ing  this  model,  we  focus  on  cases  in  which  one  is  interested  in  identifying  one  active 
structure  from  a  relatively  small  (and  therefore  tractable)  enumerated  set  of  possible 
dependence  relationships.  In  the  absence  of  any  prior  knowledge,  inference  over  struc¬ 
ture  is  discussed  in  a  maximum  likelihood  framework  and  shown  to  be  a  point  estimate 
approximation  to  Bayesian  inference. 
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Moving  beyond  a  tractable  enumerated  set  of  structures,  in  Section  3.4,  we  present 
a  more  detailed  directed  static  dependence  model  which  explicitly  encodes  the  causal 
relationships  among  time-series  in  terms  of  directed  structures.  The  number  of  possible 
directed  structures  grows  super-exponentially  with  the  number  of  time-series  being  mod¬ 
eled.  A  conjugate  prior  on  the  directed  structure  and  parameters  of  this  model  is  pre¬ 
sented  which  allows  one  to  reason  over  the  set  of  all  directed  structures  in  exponential¬ 
time  complexity.  Furthermore,  by  imposing  simple  local  or  global  structural  constraints 
we  show  that  one  can  reduce  the  exponential-time  complexity  to  polynomial-time  com¬ 
plexity  for  reasoning  over  a  still  super-exponential  number  of  candidate  structures. 
Specifically  we  focus  on  bounded  in-degree  structures  with  directed  trees  and  forests 
being  special  cases  with  global  constraints.  These  constraints  yield  tractable  Bayesian 
inference  over  directed  structures,  allowing  exact  calculation  of  the  partition  function 
as  well  as  additional  marginal  event  probabilities.  The  method  we  present  for  Bayesian 
reasoning  over  structure  is  closely  related  to  that  of  [33,  54],  but  extended  to  the 
analysis  of  time-series  for  which  the  strict  temporal  ordering  provides  a  computational 
advantage. 

■  3.1  Notation 

We  begin  by  introducing  some  notation  for  the  purpose  of  explicitly  denoting  individual 
time-series,  past  values  of  individual  time-series  as  well  as  sets  thereof.  This  notation 
is  summarized  in  Table  3.1  for  future  reference. 

Consider  N  time-series  and  let  x{'  be  a  random  variable  representing  the  vector 
value  of  the  u-th  time-series  at  time  t.  The  random  variable  X*  is  the  set  of  the 
random  variables  for  all  N  time-series  at  time  t,  i.e.  Xf  =  |x{ , . . . ,  x^}.  Subsets 
S  C  {1, ...  ,1V}  of  time-series  can  be  indexed  using  the  notation  xf .  For  example,  the 
random  variable  representing  values  of  times-series  1,2  and  4  at  time  t  is  xf’  ’  .  For  a 
model  of  order  r,  the  r  past  values  of  the  u-tli  time-series  at  time  t  is  represented  by  the 
random  variable  x}.  That  is,  x}  is  the  set  of  {x"_1, . . . ,  x{_r }  •  As  we  will  be  explicit  on 
the  temporal  model  order,  r  is  suppressed  in  the  notation,  x},  for  brevity.  The  random 
variable  Xt  =  xj’ '"’N  =  |x] , . . .  ,  x^}  is  the  r  past  values  for  all  N  time-series. 

Multiple  time  points  can  be  indexed  by  a  vector  t  =  [ti ,  t2, ...,  tr]  such  that  Xt  = 
{Xf1;  ...,XtT}.  We  will  sometimes  reference  a  continuous  set  of  time  using  the  notation 
1  :  T  rather  than  t  =  {1,2,...  , T}.  Note  that  the  full  past  Xt  can  be  formed  from 
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N 

Number  of  time-series 

r 

Order  of  Markov  temporal  dependence.  The  current  time  is 
dependent  on  the  r  past  values. 

Random  variable  representing  the  vector  value  of  the  u-th 

time-series  at  time  t 

„S  _  Sl,...,Sm 

Xi  —  Xt 

Random  variable  representing  a  set  of  time-series  at  time  t 
where  S  =  {5i, . . . ,  Sm}.  For  example,  x2'2  represents  xj  and 
x|.  We  will  sometimes  treat  xf  as  a  single  random  vector. 

X£ 

Random  variable  representing  the  r  past  of  time-series  v  at 
time  t.  Specifically,  x/  represents  the  set  xj’_1  through  x/_r. 

Three  simple  cases:  If  r  =  0,  ix/  =  0.  If  r  =  1,  xj1  =  xj’_1.  If 
r  =  2,  x^  =  {xJ’1,Xj 2} 

Xt 

Random  variable  representing  all  N  time-series  at  time  t. 

Xt  =  X* 

xt 

Random  variable  represent  the  r  past  of  all  N  time-series  at 

time  t.  Xt  =  . 

t  =  ...jir} 

A  set  of  time  points. 

1  :  T 

The  set  of  numbers  from  1  to  T.  In  general,  a  :  b  represents 

the  set  of  numbers  from  a  to  b.  Often  we  use  this  notation  in 

place  of  t  =  {1, . . . ,  T}. 

xt 

Random  variable  representing  all  time-series  at  the  time 
points  specified  by  t,  {Xtl , . . . ,  Xtr}.  Similarly  xf  is  all  time- 
series  specified  in  set  S  at  the  points  specified  in  t. 

V 

Observations  of  all  time-series  over  all  time. 

v? 

Observation  of  xf 

V 

Observation  of  all  past  values  Xt.  Note  that  T>  can  be  formed 

from  V  and  additional  information  C. 

c 

Set  of  initial  conditions  or  extra  observations  needed  to  form 

Xt  and/or  T>.  For  example,  if  t  =  {1, . . . ,  N}  and  r  =  0,  one 
needs  C  =  Xq  to  form  Xt. 

Table  3.1.  Notation  Summary:  Notation  for  time-series,  past  values  of  time-series  and  sets  thereof. 
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Xt  and  a  set  C  containing  values  not  available  in  Xt.  For  example,  if  r  =  1,  X] can 
be  formed  from  Xi :t  and  initial  conditions  C  =  {Xo}.  We  will  often  treat  these  sets 
of  random  variables  as  random  vectors  or  matrices  when  convenient.  For  example,  one 
can  treat  xj;^  as  a  matrix: 


X1:T 

9 

— 

1,2 

X1  ' 

1,2 

— 

*1  • 

9 

9 

Lxf  • 

.  X^ 

We  extend  this  notation  to  describe  observed  data.  Let  V  be  a  set  of  T  complete 
observations,  {Xi, . . . ,  X^}.  We  use  the  notation  Vs  to  denote  observations  of  specified 
set  of  time-series,  {x^, . . . ,  x®}.  An  observation  at  time  t  of  all  time-series  is  denoted 
as  T>t  and  specifies  the  observation  for  w-th  time-series  at  time  t.  We  use  the 
notation  V  for  observations  of  the  past  {Xi,...,X^}  .  It  can  be  formed  using  T>  and 
past  information  or  initial  conditions  C. 

■  3.2  Static  Dependence  Models 

In  this  chapter,  we  focus  on  the  design  of  a  static  dependence  model  for  N  time-series. 
Given  a  specified  structure  E  and  set  of  parameters  0,  this  model  is  assumed  to  be 
r-th  order  Markov: 

T 

p(Xt\E,Q,C)  =  l[p(xti\Xti,E,e)  .  (3.2) 

2—1 

That  is,  we  assume  a  fixed  temporal  structure  in  which  Xt  is  dependent  on  its  r  past 
values  Xf.  Here,  the  structure  E  represents  the  dependence  among  the  N  time-series 
assuming  this  fixed  temporal  Markov  relationship.  In  order  to  simplify  notation  we  will 
drop  the  C  when  it  is  clear  from  the  context  and  use  p  (Xt|.E,  0). 

Note  that  if  r  =  0  then  X*  =  0  and  one  obtains  a  distributions  that  is  independent 
and  identically  distributed  over  time: 

T 

p  (Xt \E,  0)  =  (X* \E,  0) .  (3.3) 

1=1 

Figure  3.1  summarizes  the  notation  and  general  form  of  a  static  dependence  model 
for  N  =  2  time-series  with  r  =  1.  It  shows  three  views.  The  upper  left  depicts  the 
two  time-series  in  an  abstract  graphical  model.  The  dependence  structure  E  specifies 
relationships  among  the  time-series.  In  the  figure  E  is  left  abstract  as  shaded  blue  re¬ 
gions.  Collapsing  the  time-series  into  single  X*  at  each  point  in  time  reveals  the  Markov 
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Figure  3.1.  Three  Abstract  Views  of  a  Static  Dependence  Model:  Three  views  of  a  static  dependence 
model  for  N  =  2  time-series  with  accompanying  notation  are  shown.  The  top  left  view  shows  an  abstract 
undirected  graphical  model  for  the  static  dependence  model  in  which  the  structure  E  would  describe 
the  relationships  between  the  time-series  in  the  shaded  blue  region.  The  bottom  left  view  exposes  the 
the  r-th  order  temporal  structure  by  representing  both  time-series  as  a  single  random  variable  at  each 
t.  The  top  right  view  collapses  the  time-series  over  all  time  and  can  be  used  to  help  understand  the 
dependence  structure  between  the  time-series.  Again,  here  E  is  left  abstract. 

structure  shown  in  the  temporal  view  on  the  bottom.  The  upper-right  graph  depicts 
each  time-series  collapsed  over  all  time  as  a  diamond  shaped  vertex.  The  dependence 
structure  E  specifies  the  relationship  among  these  vertices  in  this  dependence  graph. 
We  use  diamond  shaped  vertices  in  the  dependence  graph  so  that  it  is  not  directly 
interpreted  as  a  graphical  model,  but  instead  provides  a  summary  of  the  dependence 
among  time-series. 

Concrete  examples  will  be  given  in  Sections  3.3  and  3.4  where  we  present  two  dif¬ 
ferent  static  dependence  models.  The  primary  difference  between  the  two  models  is  the 
way  in  which  the  structure  is  specified  and  how  we  will  use  each  model  to  reason  about 
dependence  among  time-series.  We  will  use  the  notation  E  when  discussing  a  general 
static  dependence  model.  This  notation  will  change  for  each  specific  models  in  order 
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to  directly  reflect  how  their  structure  is  specified.  Specifically,  in  Section  3.3  we  will 
introduce  a  model  which  specifies  structure  in  terms  of  hyperedges  F,  while  in  Section 
3.4  we  will  discuss  a  model  which  uses  directed  edges  E. 


■  3.2.1  Structural  Inference 


Given  a  static  dependence  model  our  ultimate  goal  is  that  of  structural  inference.  That 
is,  we  are  interested  in  inferring  the  structure  E  given  a  set  of  observed  data  V  while 
treating  0  as  nuisance  parameters  .  In  structure  inference  problems,  as  in  any  Bayesian 
inference  problem,  one  would  ideally  like  to  calculate  the  posterior  on  structure  given 
the  observed  data  V.  Using  Bayes  rule  and  basic  properties  of  probability  we  see  that 
the  posterior  has  the  form: 


p(E\V) 


-^Po  (E)p(V\E) 
p(V) 

-^Po(E)  [  p(V\E,e)p0(O\E)dQ. 

P\V)  Je\E 


(3.4) 


where  po  ()  is  used  to  indicate  a  prior  distribution.  There  are  three  main  challenges  one 
needs  to  address  before  being  able  to  calculate  the  posterior: 


1.  One  must  define  priors  over  the  space  of  parameters  and  structures  that  can  be 
reasoned  over  efficiently. 

2.  One  must  be  able  to  tractably  compute  the  integral  over  the  unknown  parameters. 

3.  In  order  to  obtain  an  exact  posterior  probability,  the  data  evidence,  p{T>),  must 
be  calculated.  This  involves  integrating  out  both  parameters  0  and  structure  E. 

The  first  challenge  is  that  of  specifying  prior  knowledge  about  parameters  and  struc¬ 
ture.  That  is,  Equation  3.4  uses  a  prior  on  structure,  po  ( E ),  in  addition  to  a  prior  on 
parameters  given  a  specific  structure  po  (0|E).  What  makes  defining  these  terms  diffi¬ 
cult  in  general  is  threefold.  First,  depending  on  how  the  class  of  structures  is  specified, 
the  number  of  possible  structures  can  be  very  large.  In  general  the  number  of  allowable 
structures  is  super-exponential  in  the  number  of  time-series.  Second,  one  must  provide 
a  prior  on  parameters  for  each  possible  structure.  Each  structure  will  generally  have 
a  different  number  of  parameters,  each  with  a  different  role.  For  example,  a  structure 
which  describes  independent  time-series  needs  fewer  parameters  than  those  which  en¬ 
code  more  dependence.  Third,  there  is  the  question  of  what  should  be  done  when  no 
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prior  knowledge  is  available.  Specifically,  while  one  may  be  able  to  define  uniform  pri¬ 
ors  on  the  discrete  set  of  structures,  placing  an  uninformative  prior  over  the  continuous 
space  of  parameters  can  be  difficult.  One  may  have  to  turn  to  placing  very  broad  priors 
as  an  approximation. 

We  will  address  the  first  challenge  in  two  different  ways  using  the  two  specific  de¬ 
pendence  models  described  in  the  following  sections.  In  Section  3.3,  we  focus  on  cases 
in  which  we  consider  a  small  set  of  allowable  structures  and,  in  the  absence  of  any 
prior  information,  adopt  a  frequentist  view  by  concentrating  on  the  likelihood  term 
p(T>\E,@)  rather  than  the  posterior.  In  Section  3.4,  we  introduce  a  model  along  with 
a  tractable  conjugate  prior  on  both  the  parameters  and  directed  structure  describing 
causal  relationships  among  time-series. 

The  second  challenge  in  calculating  Equation  3.4  is  related  to  the  fact  that  we  are 
primarily  interested  in  making  probabilistic  statements  regarding  the  structure  and  are 
treating  the  parameters  0  as  a  nuisance.  The  integration  over  all  parameters  weighted 
by  their  prior  avoids  having  to  estimate  or  choose  one  particular  parameter,  however, 
this  may  be  difficult  to  compute.  We  address  this  challenge  in  two  different  ways  using 
the  two  models  presented  in  Sections  3.3  and  3.4.  In  Section  3.3.2  we  discuss  maximum 
likelihood  inference  and  show  the  sense  in  which  it  approximates  this  integration  with 
a  point  estimate.  In  Section  3.4.3,  using  a  different  model,  we  show  how  to  calculate 
this  evidence  term  using  a  conjugate  prior. 

The  third  challenge  in  Equation  3.4  is  that  of  calculating  p  (T>).  This  is  of  concern 
if  one  desires  an  exact  calculation  of  the  posterior.  Maximum  a  posteriori  and  maxi¬ 
mum  likelihood  estimates  of  E  do  not  require  this  normalization  term.  However,  if  an 
exact  posterior  can  be  obtained,  a  wide  variety  of  useful  statistics  and  exact  marginal 
posterior  probabilities  can  be  calculated.  These  quantities  allow  a  full  characterization 
of  uncertainty  in  the  structure.  In  Section  3.4.3  we  present  priors  which  are  conjugate 
and  allow  for  tractable  exact  posterior  calculations. 

■  3.3  Factorization  Model 

In  the  previous  section  we  discussed  the  general  form  of  static  dependence  models  for 
time-series.  Here,  we  specialize  the  model  such  that  the  dependence  among  time-series 
described  in  terms  of  independent  groups.  We  denote  these  models  as  static  factoriza¬ 
tion  models ,  FactMs,  and  will  sometimes  use  the  notation  FactM(r)  to  explicitly  specify 
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Table  3.2.  Example  FactMs:  Simple  examples  showing  a  specific  factorization  F  and  the  form  of  the 
resulting  distribution  with  N  =  2  and  r  =  1  or  r  =  0. 

r  F  p  (xt\Xt,F,  © 


{{1,2}} 
{{!}, {2}} 
{{1,2}} 


P  (xj|xj_!,  0i)  p  (xf  |xf_1;  02) 

P(xt,xtlxt1-l,xt2-l,0l,2) 

p(xf|01)p(x?|02) 


p  (xt , xt 


101,2) 


the  temporal  order  r.  The  independent  groups  are  specified  by  a  set  of  hyperedges  F 
which  form  a  partitioning  of  the  set  {1, . . . ,  N}.  Specifically,  F  is  a  set  of  |F|  hyperedges 
where  each  Fi  €  F  is  restricted  such  that: 


F\ 

\jFi  =  {l,2,...,N} 

i= 1 

(3.5) 

F,f)F3=$  V  6  {1,2,  ...,1V}. 

(3.6) 

That  is,  the  union  of  all  hyperedges  is  the  full  set  {1, . . . ,  N}  and  no  hyperedges  share 
elements.  For  N  =  2,  only  F  =  {{1},{2}}  and  F  =  {{1,2}}  are  consistent  with  this 
definition. 

Given  F  and  parameters  0,  a  FactM  conditional  distribution  takes  the  form: 

\F\ 

p  (xt|Xt,  F,  ©)  =  n  p  (xf'l  xff,eFf)  .  (3.7) 

/=! 

That  is,  the  conditional  distribution  factorizes  according  to  F  with  each  factor  hav¬ 
ing  its  own  set  of  parameters.  Note  the  change  in  notation  from  the  more  general 
p  ^Xf  \Xt,  E,  ©'j  in  Equation  3.2  to  this  specific  form  p  ^X^jX^,  F,  0^  .  That  is,  in  order 
to  be  explicit  about  how  structure  is  specified  and  to  be  consistent  with  the  notation  in 
Chapter  2  we  replace  E  with  hyperedges  F  when  describing  a  FactM.  In  this  context, 
will  also  often  refer  to  the  set  of  hyperedges  F  as  a  factorization  and  Fi  E  F  as  the  z-th 
factor. 

Table  3.2  presents  some  simple  examples  for  N  =  2.  Perhaps  the  simplest  model  to 

understand  is  the  case  in  which  r  =  0.  If  one  assumes  a  Gaussian  form  for  each  factor 

Ft 

model,  each  parameter  0 pf  contains  the  mean  and  covariance  of  the  elements  in  xt  1 . 
Thus,  the  difference  between  the  last  two  rows  in  Table  3.2  is  that  of  a  Gaussian  with 
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Figure  3.2.  Example  FactM(l)  Structure  and  Association  Graph:  The  Markov  structure  of  an 
example  FactM(l)  model  for  N  =  3  time-series  is  shown  in  (a)  and  corresponding  association  graph  in 
the  form  of  a  factor  graph  is  shown  in  (b). 

a  block  diagonal  covariance  formed  by  the  two  block  elements  specified  by  ©i  and  02 
(3rd  row),  versus  one  with  a  full  covariance  specified  by  0^2  (4th  row). 

Figure  3.2  shows  two  views  of  a  FactM(l)  for  N  =  3  time-series  with  factorization 
F  =  {{1,  2},  {3}}.  This  factorization  is  just  one  of  the  five  possible  for  three  time-series. 
Figure  3.2(a)  shows  the  structure  of  this  FactM  as  an  undirected  graphical  model. 
Figure  3.2(b)  provides  an  alternative  view  as  a  dependence  factor  graph.  We  will  often 
refer  to  this  factor  graph  as  the  association  graph  to  emphasize  that  one  can  think  of  all 
time-series  that  belong  to  a  common  factor  as  being  “associated” .  Additionally,  we  will 
use  such  graphs  in  Chapter  5  when  describing  an  audio-visual  association  task.  The 
association  graph  hides  the  temporal  structure  and  uses  diamond  shaped  vertices  with 
a  number  to  represent  entire  time-series. 

■  3.3.1  The  Set  of  Factorizations 

Definition  3.3.1  (Bn)-  Bn  is  the  set  of  all  partitions  of  the  set  {1, . . . ,  N} 

As  just  discussed,  the  dependence  structure  of  a  FactM  is  specified  by  an  F  G  Bn- 
An  important  question  one  may  ask  is:  how  many  distinct  dependence  relationships 
can  one  represent?  That  is,  what  is  the  size  of  the  set  Bn-  This  is  a  well  understood 
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quantity  called  the  Bell  number1  B n  =  \Bn\,  [78].  For  N  =  1  there  is  only  a  single 
partition,  {{1}},  and  thus  B\  =  1.  Higher  Bell  numbers  can  be  obtained  using  the 
recursion 


Bn  = 


(3.8) 


In  words,  the  recursion  says  that  to  find  the  number  of  partitions  of  N  numbers  you 
sum  up  a  series  of  terms  from  k  =  l  to  k  =  N  —  1  each  of  which  is  the  number  of  ways 
you  can  partition  a  set  of  k  numbers,  B *.,  multiplied  by  the  number  of  ways  you  could 


have  chosen  those  k  numbers  from  the  a  set  of  size  N  —  1, 


(N-  1 

V  k 


The  addition  of  1 


is  due  to  the  fact  that  there  is  always  the  partition  which  is  the  full  set  {{l,...,iV}}. 

Equation  3.8  provides  an  algorithm  for  calculating  the  number  of  factorizations  for 
a  given  N.  Another  important  question  is:  how  does  this  number  grow  with  N ?  The 
asymptotic  form  of  the  Bell  number  was  derived  by  De  Bruijin  [22]  and  simplifies  to 


Bn  —  eN (log  log  log  N +0(1)) 

(3  91 

=  A rNeN{0(l)-loglogN)'  1  ’ 

That  is,  the  number  of  possible  factorizations  grows  super-exponentially  with  N. 

Reasoning  over  the  full  set,  Bn,  is  generally  intractable  for  large  N.  However,  in 
this  dissertation,  we  use  FactMs  for  tasks  in  which  there  is  a  small  tractable  subset 
of  M  factorizations  one  is  interested  in.  Specifically,  in  Chapter  5  the  factorization 
structures  considered  are  linked  to  a  tractable  number  of  possible  sources  of  speech  in 
an  audio-visual  association  task. 

■  3.3.2  Maximum-Likelihood  Inference  of  FactM  Structure 

Given  observed  data  T>,  we  wish  to  infer  structure  F.  Here,  using  a  FactM,  we  address 
the  challenges  discussed  in  Section  3.2.1  by  adopting  a  frequentist  view  and  focus  on 
obtaining  point  estimates  of  structure  rather  than  the  full  posterior.  That  is,  in  the 
absence  of  any  prior  information,  we  choose  a  maximum  likelihood  approach.  Maximum 
likelihood  (ML)  seeks  to  find  the  structure  which  best  explains  the  observed  data.  As 
discussed  in  Section  3.2.1  inference  over  structure  requires  some  way  of  dealing  with 


1In  honor  of  Eric  Temple  Bell 
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nuisance  parameters  0.  A  common  approach  in  the  ML  framework  is  to  use  the  max 
over  all  parameters  as  well.  That  is,  for  a  general  static  dependence  model  with  structure 
E  we  wish  to  find: 


E  =  argmaxmaxp  (T>\E,  0) .  (3.10) 

E  0 

One  can  view  this  ML  optimization  as  an  approximation  to  finding  the  maximum  a 
posterior  (MAP)  estimate  of  structure  E.  It  is  equivalent  to  setting  a  uniform  prior  on 
structure,  po(E )  =  f3,  approximating  the  evidence  p(T>\E)  with  the  a  point  estimate 
p  ( T>\ E,  0 )  and  maximizing  Equation  3.4  over  E: 


argmaxp  (E\V)  =  argmaxpo  (E)  /  p  (D\E,  0)  po  (0|E)  d@  (3-11) 
E  E  Je\E 

=  argma x/3  f  p  (V\E,Q)  po  (@\E)  d@  (3.12) 

e  Jq\e 

«  argmaxp  (f>\E,  (3.13) 

~  argmaxmaxp  (D|E,  0)  (3-14) 

E  0 

=  E  (3.15) 


where  the  approximation  made  in  Equation  3.13  is  that  the  integral  above  can  be  well 
approximated  by  only  considering  a  single  point  0.  Equation  3.14  further  assumes  that 
the  best  single  point  is  the  one  that  maximizes  p  (T>\E,Q). 

We  will  use  this  ML  approach  with  a  FactM  in  an  audio-visual  association  task 
in  Chapter  5.  Here,  we  examine  this  optimization  in  more  detail  specifically  for  the 
FactM.  Substituting  a  FactM  into  Equation  3.10  and  using  the  monotonicity  of  the  log 
function  one  obtains 

F  =  argmaxmaxp  (P|E,  0)  (3.16) 

F  0 

=  argmaxmax  —  logp  (P|E,  0)  (3-17) 

p  &  T 

1  T  \F\ 

=  argmaxmax  —  ^  logp  (v?f  \v[f ,  0F/)  (3.18) 

F  '  t=i/= i 

|F|  1  T  p  ~p 

=  arg  max  ^  max  —  ^  log p  (v[f  \VFf ,  0Ff 
F  f= l  0F/  1  t= l 


(3.19) 
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By  simplifying  notation,  we  see  that  ML  inference  of  structure  in  a  FactM  involves 
finding  the  factorization  F  which  maximizes  the  product  over  a  set  of  weights  associated 
with  each  factor: 


\F\ 

F  =  argmax 

F  f=  1 


where 

1  T 

WFj  =  max  -  ^  log p  \VFf ,  0F/) 

w  1  t=  l 

=  maxE  [logp  0F/) 

=  -^£(pf'|pf';0F/) 


(3.20) 


(3.21) 

(3.22) 

(3.23) 


Here  E  []  is  the  sample  average  and  H  (;  6)  is  the  empirical  estimate  of  entropy  using  the 
parameters  6.  The  weight,  Wpf,  is  simply  the  log  likelihood  of  the  time-series  indexed 
by  factor  Ff  maximized  over  all  parameters  for  that  factor.  This  is  related  to  finding 
the  parameters  which  minimize  the  uncertainty,  conditional  entropy.  For  the  simple 
case  of  r  =  0  and  Gaussian  factors,  the  weight  for  factor  Ff  is  the  likelihood  of  the  data 
within  that  factor,  given  the  maximum  likelihood  estimate  of  the  mean  and  covariance 
of  that  factor. 

When  considering  all  factorizations,  there  are  2^  —  1  unique  possible  factor  sets  to 
calculate  weights  for.  However,  again,  for  our  applications  of  interest  we  will  restrict  F 
to  be  one  of  M  possible  factorizations  which  potentially  have  many  factors  in  common. 
This  greatly  reduces  our  search  and  the  number  of  weights  needed  to  be  calculated. 


A  Closer  Look  at  ML  Inference  of  Structure 

An  alternative  view  of  this  particular  ML  inference  task  is  as  an  M-ary  hypothesis  test 
over  M  allowable  FactMs,  each  with  a  unique  structure  but  unknown  parameters.  Here, 
we  analyze  how  two  hypothesized  FactMs  distinguish  themselves  from  each  other  by 
examining  the  form  of  the  likelihood  ratio  used  when  performing  such  a  test.  We  will 
show  that  one  can  separate  out  the  role  of  structure  from  that  of  parametric  differ¬ 
ences  when  deciding  between  two  FactMs.  When  using  an  ML  approach  for  estimating 
parameters  from  data,  we  show  that  separability  due  to  parametric  differences  is  lost, 
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making  it  more  difficult  to  distinguish  between  hypothesized  models.  This  form  of  anal¬ 
ysis  was  originally  presented  by  Ihler  et  al.  [47].  Here,  we  specialize  the  analysis  to  our 
particular  problem  formulation.  We  contrast  this  analysis  with  inference  using  dynamic 
dependence  models  in  Chapter  4. 

Consider  a  case  in  which  there  are  two  hypotheses  of  interest: 

Hx  :  V  ~p(X1:T|F1,01)  (3.24) 

H2  :  P~p(X1:T|F2,©2)  .  (3.25) 


Hypothesis  H\  states  that  a  set  of  time-series  are  drawn  from  a  FactM  with  structure 
F 1  and  parameters  01.  Similarly  H2  has  a  structure  F 2  and  parameters  02.  Assuming 
p(Hi)  =  p(H2),  A  binary  hypothesis  test  takes  the  form: 


h,2  =  log 


H  i 

p  (T>  F1 

.e1) 

> 

< 

h2 

P(P\F\ 

e1) 

> 

p  (V\F2, 

©2) 

< 

h2 

p(V\F2,@2) 

0 


(3.26) 

(3.27) 


It  is  easy  to  see  how  this  is  equivalent  to  ML  over  the  set  of  the  two  possible  models: 


(3.28) 


When  analyzing  how  these  hypotheses  distinguish  themselves  from  each  other  it 
will  be  useful  to  define  a  special  factorization,  Fn,  which  describes  the  common  sets 
of  variables  which  are  dependent  under  both  H\  and  H2.  It  can  be  formed  by  keeping 
all  the  unique  non-empty  intersection  sets  F f  fl  F2  for  all  i  G  {1, . . . ,  1-F1!}  and  j  G 
{!,...,  IT2!}.  For  example,  for  N  =  4  and 


F1  =  {{1,  2, 3},  {4}}  (3.29) 

F2  =  {{1,2,4},{3}}  (3.30) 


the  common  factorization  is 


Fn  =  {{!,2},{3},{4}}. 


(3.31) 
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Using  this  common  factorization  Fr  we  can  define  two  conditional  distributions  using 
the  parameters  associated  with  each  hypothesis  and  the  structure  Fn:  p  ^Xt|Xf,  Fn ,  01  j 

and  p  ^Xj |Xj,  Fn ,  02^  .  Specifically,  for  h  €  {1, 2}  we  define: 


where 


p(xt|Xi,Fn,©'1)  =  []  p  (x{ |x{,  Gh) 

f€F  n 


f\~f 

xi  lxi  . 


/p  (xt,  Xt  |Fft,  0h)  dxtRdXR 

/p(Xt|Fft,0ft)dxfR 


(3.32) 


(3.33) 


and  R  =  {1, . . . ,  N}  \f  is  set  of  elements  not  in  factor  /.  That  is,  p  ^Xt  |Xt,  Fr,  Qh^j  fac¬ 
torizes  according  to  the  common  structure  Fn  with  each  factor  distribution  marginally 
consistent  with  hypothesis  H^  at  time  t. 

If  both  the  parameters  and  structure  are  known  the  log  likelihood  ratio  for  H \  and 
Hi  given  data  T>  takes  the  form  : 


h,2  =  log 


p  (RI-F1, 01) 


p  (V\F2,  02) 

T  pbtiVuF^e1 


Elos: 


t= i 


p  [Vt\Vt,F2,Q2 


5>- 


p(vt\Dt,F\Ql)  p(Vt\Vt,Fn,Q 


t= l 


p  (vt\Vt,F^,Gh)  P  (Vt\Vt,F2,@2 


(3.34) 

(3.35) 


(3.36) 


where  the  last  step  is  simply  a  multiplication  by  one  and  h  is  either  1  or  2. 
is  true,  the  expectation  of  the  likelihood  ratio  for  a  finite  realization  is: 

Given  //] 

Ed  [Zi,2 \Hi] 

=  Y,d{p 

t= l 

(Eia,^1,©1)  ? 

^(ElA,^,©1)  ) 

(3.37) 

T 

+E«( 

*=1 

p^Ia,^,©1) 

p(Pt|A,F2,©2)  ) 

(3.38) 

and  similarly  under  Hi, 

Ed  M Hi] 

1 

=  -E»( 

t=i 

p(E|Rf,F2,©2) 

p(E|A,Fn,02)  ) 

(3.39) 

T 

-EG 

p(Rf|Rt,Tn,02) 

p(pt|A,F1,©1)  ) 

(3.40) 

t=  l 
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A  full  derivation  is  provided  in  Appendix  B.l.  Under  each  hypothesis  the  expected  log 
likelihood  ratio  can  be  decomposed  into  two  sets  of  KL  divergence  terms.  The  first 
set  of  terms  compares  the  true  structure  to  the  common  structure  under  a  consistent 
set  of  parameters.  They  capture  the  differences  in  statistical  dependence  between  the 
hypotheses.  The  second  set  of  terms  compare  differences  in  in  both  structure  and 
parameters. 

As  described  above,  when  the  true  parameters  are  unknown  an  ML  approach  es¬ 
timates  them  from  the  data.  This  corresponds  to  a  generalized  likelihood  ratio  test 
(GLRT)  in  which  point  estimates  of  parameters  01  and  02  are  used  in  place  of  the  true 
unknown  parameters.  A  GLRT  takes  the  form 


h,2  =  log 


p(v\f2,q 2) 


Hi 

> 

< 

H2 


(3.41) 


Given  a  single  data  set  V  to  analyze,  the  parameters  01  and  02  are  both  estimated 
by  maximizing  the  likelihood  of  same  data,  under  different  factorizations.  This  data 
came  from  a  single  unknown  hypothesis.  The  estimate  of  the  parameters  for  the  true 
hypothesis  will  be  asymptotically  accurate  given  enough  data  and  a  consistent  ML 
estimator.  However,  the  parameter  estimates  for  the  incorrect  hypothesis  will  not. 
Given  a  consistent  estimator  (and  some  assumptions  of  ergodicity  and  stationarity)  the 
estimated  distribution  for  the  incorrect  hypothesis  will  converge  to  a  FactM  with  the 
common  structure  Fn  and  parameters  consistent  with  the  true  hypothesis.  That  is,  if 
the  data  came  from  Hi, 

p  (vt\Vt,  F\  01)  ^p(vt\Vt,  F1,  01)  (3.42) 

p  (vt\Vt,  F2,  02)  %  (vt\Vt,  Fn,  01)  (3.43) 

and  if  the  data  came  from  H2 

p  (vt\Vt,  F1,  01)  %p(vt  \Vt,  Fn,  02)  (3.44) 

P  (Vt\ Vt,  F2,  02)  %p(vt\Vt,  F2,  ©2)  (3.45) 

See  Appendix  B.2  for  a  full  derivation  and  assumed  conditions.  Asymptotically  the 
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expected  log  likelihood  ratio  of  the  GLRT  ,  Zp 2,  becomes: 


T 


E© 

h,2\H\ 

p(z>t I^F1,©1) 

t=i 

T 

E© 

U.2W2 

^-^F(p(pt|Pt,F2,02) 

t= 1 


p(vt\Vt,Fn,@^  ) 


p(vt\Vt,Fn,@ 2) 


+  0 

+  0 


(3.46) 

(3.47) 


The  terms  comparing  both  parametric  and  structural  differences  go  to  zero  and  all  that 
is  left  is  the  structural  comparison  under  consistent  parameters  estimated  from  the 
data.  In  other  words,  the  consequence  of  estimating  parameters  for  the  two  different 
models  from  the  same  data  is  the  loss  of  ability  to  exploit  parametric  differences. 


Nested  Hypotheses 

Another  issue  arises  when  the  structures  that  are  being  reasoned  over  are  increasingly 
expressive,  yielding  nested  hypotheses.  That  is,  problems  arise  when  data  generated 
from  a  FactM  using  structure  FA  can  be  explained  equally  well  by  another  which  uses 
a  more  expressive  structure  FB  and  a  specific  choice  of  parameters.  A  FactM  using 
Fb  is  more  expressive  than  one  using  FA  if  all  factors  in  FA  are  subsets  of  factors 
in  FB .  Another  definition  is  that  FB  is  more  expressive  than  FA  if  their  common 
structure  Fn  =  FA.  For  example,  all  factorizations  are  more  expressive  than  the  fully 
independent  factorization. 

As  a  consequence  of  using  ML  estimation,  the  more  expressive  model  can  always 
achieve  an  equal  or  higher  likelihood  compared  to  those  less  expressive.  This  will  result 
in  always  choosing  the  most  expressive  model  (or  the  best  among  a  set  of  most  expressive 
models).  Consider  only  two  factorizations  F1  =  {{1,2}}  and  F2  =  {{1},{2}}.  An  ML 
approach  would  always  choose  F1  when  estimating  the  parameters  from  a  single  finite 
realization  of  data  and  Zi  2  will  be  greater  than  or  equal  to  zero. 

This  is  a  common  problem  when  reasoning  over  nested  models  using  GLRTs.  One 
standard  solution  is  to  estimate  significance.  That  is,  estimate  a  p-value,  which  says 
how  likely  it  is  that  Zi  2  is  greater  than  or  equal  to  the  value  obtained  when  data  comes 
from  the  less  expressive  model.  The  p-value  is  an  estimate  of  probability  of  false  alarm 
if  the  more  expressive  model  is  chosen.  It  can  be  approximated  by  bootstrap  sampling 
new  sets  of  data  from  T>  which  are  forced  to  obey  the  restrictions  of  the  less  expressive 
model  (c.f.  [38,  77,  42,  47,  83]).  A  decision  can  be  made  by  choosing  a  threshold.  If 
the  p-value  is  below  a  set  significance  value  threshold  the  less  expressive  hypothesis  is 
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rejected. 


Illustrative  Example:  Dependent  versus  Independent  Gaussians 


In  order  to  help  understand  the  above  analysis,  here,  we  look  at  a  simple  example  in 
which  N  =  2,  r  =  0  and  each  factor  distribution  is  Gaussian2.  In  this  case,  the  time- 
series  have  no  temporal  dependence  and  there  are  only  two  possible  factorizations. 
Consider  the  following  two  hypotheses: 


#1  :  A  -^(XilF1,©1) 


H2-.T>t~p  (Xi|F2,  02) 


M 


x; 


xr 


P(xtl©lMXtl02) 


=  N  (xt ;  A,  l)  N  (xj;  A,  l) 


(3.48) 


(3.49) 

1  0 
0  1 


Hypothesis  H\  corresponds  to  a  dependent  factorization  with  zero  mean  and  dependence 
p,  while  H2  assumes  independence  with  mean  offset  by  A.  Note  that  the  common 
structure  Fr  =  {{1},{2}}  =  F 2  due  to  the  fact  that  F1  is  more  expressive  than  F2. 
Furthermore, 


P  (Xf  |Fn,  01)  =M 
p(Xt  |Fn,02)  =  JV 


x? 


xt 


Xt 


Xt 


o’ 

i  o" 
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0  1 
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1  o’ 
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0  1 

=  p(X*|F2,02) 


(3.50) 

(3.51) 


as  defined  in  Equation  3.33. 

Since  we  are  dealing  with  Gaussian  distributions,  in  this  case,  the  expected  log 
likelihood  ratio  in  Equations  3.38  and  3.40  can  be  computed  in  closed  form.  That  is, 

=T^-ilog(l-p2)^  +T(2A2)  (3.52) 

IE®  [h,2\H2]  =  0  -rQ  log(l  -  p2)  +  2  l  +  f}1p2P)  -  2 ) 

(3.53) 

2This  is  similar  to  the  example  presented  in  [47].  Here,  the  parametric  differences  are  in  terms  of 
the  mean  rather  than  marginal  variance. 
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These  expectations  are  true  when  the  parameters  associated  with  each  hypothesis  are 
known.  When  the  parameters  are  unknown,  an  ML  approach  estimates  the  parameters 
for  each  factorization  from  the  data.  In  this  case,  one  would  obtain  an  ML  estimate  of 
the  mean  and  covariance  of  the  given  data  T>: 
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pa  i  a  2  a  2 


Using  these  ML  parameter  estimates  in  place  of  the  true  parameters  we  obtain 
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(3.54) 

(3.55) 

(3.56) 


to  use  when  performing  maximum  likelihood  inference.  The  mean  and  marginal  variance 
estimates  will  be  the  same  for  each  factorization  since  the  parameters  are  estimated  from 
the  same  data.  Note  that  the  only  difference  is  that  in  the  dependent  model  we  use 
the  estimate  of  correlation  p.  p^FlF1,©1^  will  always  be  greater  than  or  equal  to 

p  (v\F2 ,Q2^J .  Thus,  F 1  would  always  be  chosen.  This  can  also  be  see  by  examining 
the  expected  log  likelihood  ratios  in  the  equivalent  GLRT: 


E© 

Ed 


h,2\H  i 


=  T  (  -7^og(l  -P2\ 
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+0 

-0 


(3.57) 

(3.58) 


All  that  remains  are  terms  involving  estimates  of  dependence.  In  this  case,  Zi  2  is  simply 
an  estimate  of  mutual  information,  J(xj ;  x2). 

An  analytic  form  for  the  p-value  can  be  obtained  here  since  we  are  dealing  with 
Gaussian  distributions3.  However,  in  general,  when  performing  a  test  of  dependence  for 
N  =  2  and  r  =  0  a  p-value  can  be  computed  by  simulating  data  from  the  independent 
factorization.  This  can  be  done  by  simply  permuting  the  time  index  for  T>1 .  Under 
the  independent  hypothesis,  all  such  permutations  should  be  equally  likely4.  For  each 

3The  GLRT  takes  the  form  a  well  studied  problem  of  deciding  whether  p  is  non-zero  given  the 
empirical  estimate. 

4If  r  >  0  one  can  approximate  samples  from  the  independent  factorization  by  randomly  shifting  one 
time-series  relative  to  the  other.  Alternatively  one  could  form  an  augmented  sample  Y±  which  contains 
T>t  and  T>t,  do  permutations  within  each  factor  of  this  augmented  joint  sample,  and  then  extract  new 
sets  if  Dt  and  T>t  to  be  used  in  calculating  L, 2- 
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permutation  a  different  estimate  of  p  and  2  can  be  computed.  The  p-value  can  be 
calculated  by  counting  the  number  of  times  a  l\p  exceed  the  value  obtained  on  the 
non-permuted  data.  This  p-value  can  then  be  used  to  make  the  decision  of  whether  to 
reject  the  independent  hypothesis  H2. 

■  3.4  Temporal  Interaction  Model 

The  factorization  model  presented  in  the  previous  section  describes  multiple  time-series 
in  terms  of  collections  of  independent  groups.  It  leaves  the  details  of  how  the  time-series 
within  a  single  group  evolve  abstract.  That  is,  the  details  are  problem  dependent  and 
implicitly  defined  by  choice  of  parameterization  for  p  \xpr .  @Ff^j  ■ 

In  this  section,  we  introduce  an  alternative  model.  A  temporal  interaction  model , 
TIM(r),  explicitly  describes  the  details  of  the  causal  dependence  among  time-series 
using  a  directed  graph  E.  Specifically,  E  is  a  directed  structure  on  N  vertices.  Each 
vertex  corresponds  to  a  time-series  and  the  directed  edges  in  the  graph  represent  the 
causal  dependence  among  time  series.  A  TIM(r)  model  with  structure  E  and  parameters 
0  factorizes  as 

N 

p  (xt| Xt,E,e)  =  n  p  (x?l%’pa(v),&vlpa(v))  ,  (3-59) 

V=1 

where,  as  defined  in  Chapter  2,  pa  (v)  returns  the  parents  of  vertex  v  given  the  structure 
E.  The  u-th  time-series  at  time  t  is  dependent  on  its  own  past  as  well  as  the  past 
of  the  time-series  in  its  parent  set  S  =  pa(u),  xf.  The  parameters  0„|S  describe  the 
nature  this  dependence5. 

Figure  3.3(a)  illustrates  a  TIM(l)  as  a  directed  Baysian  network  for  IV  =  3  time- 
series.  Here,  E  containing  two  edges;  one  from  2  to  to  1,  and  one  from  2  to  3  and 
thus  pa(l)  =  pa  (3)  =  2  and  pa  (2)  =  0.  Note  that  E  only  describes  the  edges  across 
time-series  and  the  edges  representing  temporal  dependence  on  past  values  of  the  same 
time-series  are  a  result  of  our  base  Markov  assumption  in  static  dependence  models. 
Figure  3.3(b)  shows  the  alternative  dependence  graph  view  of  this  model  which  hides 
within  time-series  dependence  by  representing  each  time-series  as  a  single  vertex.  We 

5By  definition,  in  a  TIM(r),  each  time-series  at  time  t  is  always  dependent  on  its  own  past.  Thus,  we 
use  Q„|s  rather  than  the  more  explicit  notation  0„|V)g  to  represent  the  parameters  of  this  relationship 
for  brevity.  One  can  image  alternative  models  in  which  this  is  not  the  case,  but  we  leave  such  models 
for  future  work. 
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Figure  3.3.  Example  TIM( 1 )  Model  Structure  and  Interaction  Graph:  The  directed  structure  of  a 
TIM(l)  for  N  =  3  time-series  is  shown  in  (a)  and  the  corresponding  interaction  graph  is  shown  in  (b). 

will  often  refer  to  this  form  of  dependence  graph  as  an  interaction  graph  in  that  it 
details  the  interactions  among  time-series.  There  is  a  one  to  one  mapping  between 
the  Bayesian  network  for  a  TIM(r)  and  its  interaction  graph.  A  directed  edge  from 
u  to  v  in  the  interaction  graph  implies  a  directed  edge  from  x“  to  x£  in  the  TIM(r). 
As  mentioned  previously,  this  interaction  graph  is  not  meant  to  be  interpreted  as  a 
graphical  model.  It  simply  provides  a  representation  in  which  the  structure  E  can  be 
easily  read  from. 

Note  that  the  TIM(l)  in  Figure  3.3(a)  has  an  interaction  graph  that  happens  to  be 
fully  connected  and  acyclic.  However,  in  general  it  need  not  be.  In  other  words,  the 
space  of  interaction  graphs  is  the  set  of  all  directed  graphs.  Cycles  in  the  interaction 
graph  do  not  result  in  cycles  in  the  Bayesian  network  for  the  TIM(r).  This  is  due  to  the 
assumption  of  temporal  causality.  The  edges  specified  in  E  specifically  describe  causal 
dependence  from  past  values  of  time-series  to  current  values.  We  show  in  the  Sections 
3.4.2  and  3.4.3  that  this  causal  assumption  and  lack  of  constraints  on  E  will  allow  for 
efficient  reasoning  over  structure. 

It  is  important  to  understand  the  differences  between  a  TIM(r)  and  the  previously 
introduced  FactM(r).  A  TIM(r)  assumes  that  the  time-series  values  at  the  t  are  condi¬ 
tionally  independent  of  each  other  given  the  past  values  of  their  parents.  This  assump¬ 
tion  is  not  made  in  a  FactM(r).  This  begs  the  question,  why  not  use  a  FactM(r)  which 
has  fewer  restrictive  assumptions?  While  the  FactM(r)  has  more  modeling  power,  it 
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achieves  this  by  leaving  its  parameterization  more  abstract.  The  structure  F  in  the 
FactM  only  explicitly  indicates  independent  groups  of  time-series.  It  leaves  the  details 
on  how  current  time-series  within  a  group  depend  on  their  past  to  be  implicitly  defined 
by  the  chosen  parameterization  and  parameter  values. 

A  TIM(r)  explicitly  describes  these  causal  relationships  via  E.  More  can  be  un¬ 
derstood  about  the  statistical  dependence  relationships  from  obtaining  an  estimate  of 
structure  rather  than  looking  at  the  particular  parameterization  or  values  of  the  pa¬ 
rameters.  Take  for  example  the  TIM(l)  shown  in  Figure  3.3(a).  A  FactM(l)  with 
structure  F  =  {{1,2,3}}  and  an  appropriate  choice  for  ©1,2,3  can  equivalently  model 
data  drawn  from  this  TIM(l).  However,  very  little  information  is  given  by  structure 
F.  In  fact,  data  from  any  TIM(l)  that  uses  a  fully  connected  directed  structure  can 
be  equivalently  modeled  with  a  FactM(r)  whose  structure  F  =  {{1, . . . ,  N}}  and  the 
appropriate  choice  of  ©i,...,at-  In  other  words,  the  directed  structure  used  by  a  TIM(r) 
imposes  more  constraints  on  the  dependence  relationships  among  time-series. 

Once  these  differences  are  understood,  the  choice  of  model  is  dependent  on  which 
best  matches  the  end  application  of  interest.  We  will  use  a  FactM  for  reasoning  over 
audio-visual  associations  in  Chapter  5  and  a  TIM  for  understanding  interactions  among 
moving  objects  in  Chapter  6.  In  our  audio-visual  association  task  we  are  simply  inter¬ 
ested  in  whether  not  there  is  any  dependence  among  an  audio  and  video  stream,  while 
in  the  object  interaction  analysis  task  our  goal  is  to  characterize  the  finer  details  of  the 
causal  relationships  among  multiple  moving  objects. 

■  3.4.1  Sets  of  Directed  Structures 

The  structure  of  a  TIM(r)  is  in  the  form  of  a  directed  graph  on  N  vertices,  with  edges 
specified  by  E.  In  this  section,  we  explore  how  the  size  of  the  set  of  possible  structures 
grows  as  a  function  N.  We  begin  by  characterizing  the  set  of  all  possible  directed 
structures. 

Definition  3.4.1  (Aw):  An  is  the  set  of  all  directed  graphs  on  N  vertices.  These  graphs 
are  allowed  to  have  cycles  but  no  self-loops,  (v  — >  v). 

For  E  G  An  there  are  2N~l  possible  parent  sets  for  each  vertex  in  the  graph.  This 
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yields  a  super-exponential  number  of  possible  directed  structures: 

\AN\  =  (2tf-1)N  =  2Na~N.  (3.60) 

It  is  simple  to  see  from  inspection  that  the  set  of  all  structures  for  a  TIM(r)  is  larger 
than  that  of  a  FactM(r),  |*4jv|  >  \Bn\-  Again,  this  is  why  a  TIM  can  explicitly  specify 
more  detailed  dependence  relationships  in  its  structure.  In  the  following  sub  sections 
we  will  discuss  increasingly  restrictive  subsets  of  An- 


Bounded  Parent  Set 

It  may  be  desirable  to  limit  the  types  of  directed  structures  considered  for  a  TIM(r). 
Constraining  the  set  of  structures  can  allow  for  more  efficient  modeling  and  inference. 
One  simple  constraint  is  to  limit  the  number  of  parents  any  vertex  has  in  the  graph 
specified  by  E.  That  is,  one  can  limit  the  in-degree  of  each  vertex. 


Definition  3.4.2  {V^):  C  An  is  the  set  of  all  directed  structures  on  N  vertices  in 

which  each  vertex  has  no  more  than  K  parents. 


A  TIM(r)  with  structure  E  £  V ^  constraints  the  maximum  number  of  “influences” 
for  each  time-series.  Given  N  verticies,  each  has 


<  N 


K 


(3.61) 


possible  parent  sets  of  a  size  less  than  or  equal  to  K.  While  this  reduces  the  number  of 
possible  structures,  there  is  still  a  super-exponential  number  of  them.  That  is  \V^  |  is 
0(Nnk). 


Directed  Trees  and  Forests 

The  set  V ^  imposes  a  local  constraint  on  parent  set  size.  However,  there  may  be 
situations  in  which  a  global  structure  constraint  is  desirable.  For  example,  one  may 
want  to  only  consider  directed  structures  which  are  acyclic  and  connected. 

Definition  3.4.3  (7)v):  Tn  C  is  the  set  of  all  directed  tree  structures  on  N  vertices. 
A  directed  tree  ( rooted  spanning  arborescence)  is  a  fully  connected  structure  in  which 
N  —  1  vertices  have  a  single  parent  while  the  remaining  vertex  has  no  parents  and  is 
designated  as  the  root. 
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If  a  TIM(r)  has  a  structure  E  €  T/v  there  exists  a  root  time-series  which  has  influence 
on  all  others  given  enough  time  due  to  connectivity.  In  addition,  each  time-series  can 
never  be  influenced  by  one  of  its  children  or  any  time-series  influenced  by  its  children. 

There  are  Ty  =  Nn~ 1  directed  trees  on  N  vertices.  A  simple  proof  comes  from 
first  considering  the  space  of  all  undirected  trees.  Cayley’s  formula  states  that  for  any 
integer  N ,  the  number  of  undirected  trees  on  N  labeled  vertices  is  Nn~2  [16].  An 
undirected  tree  can  always  be  converted  to  a  directed  tree  by  picking  a  root  and  then 
directing  all  edges  away  from  the  root  in  succession.  Since  there  are  N  ways  to  choose 
the  root  there  are  N(Nn~2)  =  Nn~ 1  directed  trees. 

Definition  3.4.4  (Ejy)  :  E\ v  C  V\j  is  the  set  of  all  directed  forest  structures  on  N  ver¬ 
tices.  A  directed  forest  is  an  acyclic  graph  with  each  vertex  having  at  most  one  parent. 
A  directed  forest  can  have  multiple  roots. 

The  set  En  removes  the  fully  connected  assumption  used  to  define  T/v-  There  are 
(N  +  directed  forests  on  N  vertices.  A  simple  proof,  again,  uses  a  mapping  from 

undirected  trees.  Given  an  undirected  tree  on  IV  +  1  vertices,  one  can  form  a  directed 
forest  on  N  vertices  by  choosing  a  special  super-root  vertex,  creating  a  directed  tree 
outward  from  this  super-root  and  then  removing  that  vertex  and  all  its  outgoing  edges. 
The  roots  of  each  tree  in  the  directed  forest  are  the  children  of  the  super-root  vertex 
in  the  undirected  tree.  Thus,  there  are  T/y  =  ( N  +  l)(Ar+1-2)  =  (IV  +  l)^-1  possible 
directed  forests. 

Note  that  T/v  C  En  C  V]^  and  all  grow  super-exponentially  with  N.  Figure  3.4  plots 
the  number  of  structures  for  the  various  sets  discussed  in  this  section  as  a  function  of  N. 
We  will  use  TIMs  in  scenarios  in  which  N  may  be  large  and  the  allowable  dependence 
relationships  are  not  specified  by  the  problem  of  interest.  That  is,  we  wish  to  reason 
over  these  large  sets  of  directed  structures.  Specifically  in  Chapter  6  we  will  reasoning 
over  as  many  as  11 10  structures  of  interactions  among  moving  objects6 *. 

■  3.4.2  Prior  on  TIM  Parameters  and  Structure 

We  will  focus  on  performing  exact  Bayesian  inference  as  described  in  Section  3.2.1  when 
using  a  TIM.  In  this  section,  we  present  prior  models  for  the  parameters  and  structures, 

6We  will  consider  directed  tree  structures  when  describing  the  relationships  among  10  players  and  a 

ball  in  a  basketball  game 
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Figure  3.4.  The  Size  of  Sets  of  Directed  Structure  vs  N:  The  number  of  possible  structures  as  a 
function  of  N  for  the  set  of  all  directed  structures  An,  directed  structures  with  at  most  2,V2n,  and  1, 
Vjf  parents,  directed  forests  J-n,  directed  trees  Tn,  and  factorizations  Bn- 

allowing  one  to  specify  prior  knowledge  about  these  unobserved  variables.  We  adopt  a 
prior  on  the  structure  and  parameters  similar  to  those  presented  in  [62,  33],  using  the 
factorization 

p0(E,@)=p0(E)p0(Q\E).  (3.62) 

Again,  there  are  various  challenges  in  designing  these  priors.  The  first  challenge  is 
due  to  the  fact  that  the  number  of  the  parameters  specified  by  0  (and  potentially  the 
values)  is  a  function  of  the  specific  structure.  For  example,  in  a  TIM(r)  the  number  of 
parameters  is  a  function  of  the  number  of  edges  in  E.  A  second  challenge  was  alluded  to 
in  Section  3.4.1.  That  is,  for  TIMs  there  are  a  super-exponential  number  of  structures 
to  consider.  The  question  remains  on  how  one  places  a  tractable  prior  over  this  large 
space  of  structures. 

We  begin  by  discussing  the  prior  on  parameters  given  a  specified  structure,  Pq  (©[T1) . 
We  adopt  a  form  for  this  prior  based  on  two  key  properties: 

1 .  The  distribution  factorizes  according  to  the  structure  and  parameters  are  assumed 
to  be  independent  of  each  other. 
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2.  The  parameters  are  modular.  That  is,  the  prior  on  parameters  associated  with  the 
w-th  time-series  given  its  parent  set  pa  (v)  is  not  a  function  of  the  global  structure 
describing  the  parent  sets  of  all  other  time-series. 

Given  a  structure  E  and  hyperparameters  T  we  assume  the  prior  on  parameters  fac¬ 
torizes  according  to  the  edge  structure  such  that 

N 

Po  (0| E)  =  n^°  (0dpa0)lT)  •  (3-63) 

v=l 

Here,  p0  (0„|S|T)  is  modular.  That  is,  it  is  the  same  for  all  structures  E  for  which  S  is 
the  parent  set  of  v.  Thus,  for  each  time-series  one  needs  to  specify  a  parameter  prior 
for  all  potential  parent  sets.  Given  this  finite  set  of  priors  for  each  v  one  is  able  to 
construct  a  full  prior  on  parameters  for  any  given  structure  E  using  Equation  3.63. 

As  discussed  in  Section  3.4.1  there  is  an  exponential  number,  2N~l,  of  possible 
parent  sets  when  E  £  An-  However,  by  bounding  the  number  of  parents  to  K  such 
that  E  £  Ey ,  one  only  needs  to  specify  a  polynomial  number  of  prior  terms. 

Next  we  focus  on  specifying  the  structural  prior,  po  (E) .  One  desirable  property 
we  wish  to  obtain  is  the  ability  to  favor  certain  structures  over  others.  Another  is  that 
the  prior  lends  itself  to  tractable  inference.  Tractability  will  be  linked  to  computation 
of  the  partition  function.  We  adopt  a  structural  prior  which  allows  one  to  favor  cer¬ 
tain  parent  sets  over  others.  While  this  prior  is  over  a  set  of  structures  which  grows 
super-exponentially  with  N,  we  show  that  it  can  be  reasoned  over  in  exponential-time. 
Furthermore,  we  show  that  by  bounding  the  size  of  parents  sets  in  E  one  can  tractably 
calculate  the  partition  function  in  polynomial-time. 

The  structural  prior  we  adopt  has  the  form 

1  N 

Po  (A)  =  y  ^pa(«),» i  (3.64) 

where,  Z((3)  is  the  partition  function.  Each  scalar  hyperparameter  f3s,v  can  be  in¬ 
terpreted  as  a  weight  on  the  parent  set  S  for  u-th  time-series.  The  prior  is  simply 
proportional  to  the  product  of  these  weights.  Note  that  if  all  /3s, v  are  set  to  1,  one 
obtains  a  uniform  prior  and  Z(/3)  is  the  number  of  possible  structures.  If  all  /3s, v  are 
set  proportional  to  the  size  of  S,  |S|,  the  prior  will  favor  dense  structures.  Equivalently, 
sparse  structures  are  favored  by  making  fis,v  inversely  proportional  to  |S|. 

As  noted  in  Section  3.4.1  there  are  2n2~n  possible  structures  E  £  An-  This  suggests 
that  one  may  need  to  explicitly  sum  over  a  super  exponential  number  of  terms  when 
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calculating  Z(/3).  Fortunately,  however,  when  E  €  An,  all  combinations  of  parent  sets 
are  allowable  and  there  are  no  global  structural  constraints  limiting  the  parent  set  of 
one  vertex  given  another.  Thus,  one  can  calculate  the  partition  function, 

N 

z^)=y.  n/W), 


EgAv=  1 


N 


J2  n 

Sjv  v=l 


(3.65) 


=  E 

Si 

N 

= 

v=l  S„ 

N 

= n^. 

V=1 

as  a  product  of  N  summations  where  a  sum  over  S„  is  simply  the  sum  over  all  2N~l 
allowable  parent  sets  for  v.  Each  7 „(/3)  is  defined  to  be  this  summation  for  a  particular 
v.  The  switch  of  the  product  and  sums  in  the  above  equation  is  possible  due  to  fact  that 
parent  set  for  one  vertex  is  independent  of  all  others.  This  allows  Z(/3)  to  be  calculated 
a  product  of  N  summations,  each  of  which  contains  2N—1  terms.  Thus,  one  can  reason 
over  all  structures  in  exponential  time,  Ar2JV_1,  rather  than  super-exponential  time. 
While  this  is  a  large  improvement,  the  exponential  number  of  terms  quickly  becomes 
intractable  for  large  N.  In  the  next  sections  we  show  how  using  the  simple  constraints 
discussed  in  Section  3.4.1  one  can  obtain  Z{(3)  in  polynomial  time  while  still  considering 
a  super-exponential  number  of  candidate  structures. 


Bounded  Parent  Sets 

Bounding  the  size  of  the  parent  set  for  each  time-series,  E  €  yields  a  partition 
function  that  has  the  same  form  as  Equation  3.65  with 

r/v(P)  =  Psv’v- 

s„,  s.t.  |s„|<a: 


(3.66) 


This  is  again  due  to  the  fact  that  the  constraint  imposed  V^'r  is  local  and  the  parent  set 
of  each  vertex  (time-series)  can  be  treated  independently  of  all  others.  The  partition 
becomes  a  sum  of  all  possible  parent  sets  of  size  less  than  or  equal  to  K.  One  can 

~  (N-  l\  k 

bound  the  order  of  the  summation  by  z2k=l  I  I  <  N  .  Thus,  only  polynomial 
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computation  time,  0(NK+1),  is  needed  to  calculate  Z(/3)  even  though  the  total  number 
of  structures  is  still  super  exponential,  0(Nnk). 

Directed  Trees  and  Forests 

If  E  €  T/V;  it  is  restricted  by  both  local  and  global  constraints.  The  local  constraint 
ensures  each  vertex  has  at  most  one  parent,  while  the  global  constraints  restrict  the 
structure  to  be  acyclic  and  fully  connected.  Due  to  the  local  single  parent  constraint, 
in  this  context,  P  can  be  thought  of  as  an  IV  x  N  matrix  with  each  hyperparameter 
/3u,v^u  being  interpreted  as  a  weight  on  the  edge  u  — ►  v,  and  f3$  v  =  PV)V  as  a  weight  on 
a  vertex  being  a  root.  The  edge  set  corresponding  to  the  nonzero  entries  of  /3  form  a 
support  graph.  We  will  assume  this  support  graph  is  connected  and  contains  at  least 
one  directed  tree. 

While  there  are  N N~1  possible  directed  trees  on  N  vertices,  the  Matrix  Tree  Theo¬ 
rem  allows  one  to  calculate  Z(/3)  in  polynomial  time.  This  theorem  was  used  by  Meila 
and  Jaakkola  [62]  for  reasoning  over  undirected  trees.  The  undirected  version  of  theo¬ 
rem  is  a  special  case  of  the  often  rediscovered,  real-valued,  directed  version,  originally 
developed  by  Kirchhoff  [51]  in  1847.  The  theorem  allows  one  to  calculate  the  weighted 
sum  over  all  directed  trees  rooted  at  r,  Zr(/3 )  via 


Zr(P)  =  E  IT  =  C°fr,r  W)) 


(3.67) 


E  rooted  at  r  u—>v 


where  Q{P)  is  the  Kirchhoff  matrix  with  its  u,v  entry  defined  as 


(3.68) 


and  Co/j  j  (M)  is  the  i.j  cofactor  of  matrix  M.  Cofi  r  (Q(P))  is  invariant  to  i  and  gives 
the  sum  over  all  weighted  trees  rooted  at  r.  A  proof  can  be  found  in  [90]. 

By  summing  over  all  N  possible  roots  one  obtains 


N 


(3.69) 


Thus,  a  straightforward  implementation  yields  0(IV4)  time  for  calculating  the  partition 
function:  0(N3)  for  each  of  the  N  Zr(P)  terms.  However,  as  pointed  out  by  Koo  et.  al 
[56],  a  useful  observation  allows  for  only  0(N3)  time  computation  of  Z{f3).  That  is,  it 
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can  be  calculated  via  a  single  determinant, 

Z{(3)  =  |Q(/3)|,  (3.70) 


where 


Qu,v(P )  = 


U  =  1 


(3.71) 


Qu,v(/3)  u  >  1 

The  proof  follows  from  the  construction  of  Qu,v  and  the  invariance  of  Cofir  (Q(/3))  to 


i: 


N 

\Qm  =  ^ZQi,v(P)Cof1>v  (Q(/?)) 

V=1 

N 

=  Y,fj^Cofhv  (Q(/3)) 

V=1 

N 

=  Y,Pv,vZv(P) 

V=1 

=  m 


(3.72) 

(3.73) 

(3.74) 

(3.75) 


Consider  a  simple  case  in  which  N  =  2  and  all  /3s  are  set  to  1.  In  this  case  Z(j3) 

is  a  count  of  the  number  of  trees  on  two  vertices.  The  matrix  Q(/3)  =  , 

Z i  (/?)  =  Z2(/3)  =  1  and  thus  Z(/3)  =  2  which  is  the  number  of  possible  directed  trees 
on  two  vertices. 

Directed  forests,  E  £  Em-,  remove  the  fully  connected  assumption  of  Tm  and  can 
have  multiple  roots.  While  this  yields  a  larger  set  of  structures,  Z(/3 )  can  still  be 
calculated  in  0(N 3)  time,  as  show  in  [56].  Specifically, 

Z(f3)  =  | Q((3)  +  diag(/3i,i, . .  - ,  /3jv,jv)|  (3.76) 


where  diag(ai,  ...,ajv)  is  a  diagonal  matrix  with  the  vector  a  on  its  diagonal.  This  is  a 
consequence  of  that  fact  that  any  directed  forest  can  be  turned  into  a  directed  tree  by 
the  addition  of  one  virtual  super-root  vertex  which  has  no  parents  and  connects  to  all 
the  roots  of  the  trees  within  the  forest.  Each  /3V>V  can  be  interpreted  as  a  weight  on  an 
outgoing  edge  from  this  super-root  vertex  to  vertex  v. 

Again,  consider  the  case  when  N  =  2  and  all  /3s  are  set  to  1.  Here,  Z(j3)  = 


3,  corresponding  to  the  fact  there  are  three  possible  forests. 
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Note  that  while  7/v  C  J~n  C  V )y,  both  directed  trees  and  forests  require  0(N3) 
computation  even  though  the  larger  class  of  V jy  only  requires  0{N2).  This  is  due  to 
the  imposed  global  acyclic  constraint  which  limits  the  parent  set  of  one  vertex  based 
on  the  parent  sets  of  others. 

■  3.4.3  Bayesian  Inference  of  TIM  Structure 

Now  that  we  have  described  the  generative  model  used  by  a  TIM  and  placed  tractable 
priors  on  the  parameters  and  structures,  we  turn  to  the  task  of  Bayesian  inference  over 
these  structures.  Specifically  we  wish  to  calculate  (as  presented  in  Equation  3.4): 

p  (£|p) =  wW* (£) p  (I>  |S) 

using  the  priors  presented  in  the  previous  section.  We  show  that  both  the  prior 
over  structure  and  the  prior  on  parameters  given  structure  are  conjugate,  allowing 
for  tractable  inference.  In  addition,  given  the  full  posterior  over  structure  we  show  how 
to  calculate  substructure  appearance  posteriors  and  expectations.  Lastly,  we  discuss 
MAP  estimates  of  structure. 

Conjugacy 

We  begin  by  examining  the  joint  posterior  over  structure  E  and  parameters  0  given 
data  V.  Consider  the  following  factorization  of  the  posterior: 

p{E,e\V)=p(e\E,V)p(E\V).  (3.77) 

Below  we  examine  each  posterior  term  and  show  that  the  priors  presented  in  Section 
3.4.2  are  conjugate.  We  begin  with  the  posterior  over  parameters  given  structure, 

p{o\e,v). 

Proposition  3.4.1  :  If  one  chooses  priors  po  (0dpo(„)|T)  that  are  conjugate  for  their  cor¬ 
responding  conditional  distributions  p  ^x^|x^’pa^,  ,  then  the  full  prior  po  (0|E) 

presented  in  Equation  3.63  is  conjugate  for  a  TIM(r). 


Proof.  Using  the  form  of  prior  on  parameters  given  structure  specified  in  Equation  3.63 
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the  posterior  is: 


(3.78) 


(3.79) 


T  N 


(3.80) 


t=  1  17=1 

N 


=  n^(0-IPa(-)l^’Pa(,;)>T)- 


(3.81) 


i;=l 


It  takes  the  same  form  as  the  prior.  That  is,  it  is  independent  over  each  time-series 
and  modular.  Each  term  is  the  posterior  on  parameters  for  the  u-th  time-series  given 
its  parents.  Choosing  a  po  (0,,|pa(„)  |Y )  which  is  conjugate  results  in  a  fully  conjugate 
prior  on  parameters  given  structure.  Thus,  the  posterior  can  be  obtained  by  a  simple 
update  to  the  hyperparameters  T  for  each  term  in  prior  using  the  data  T>.  □ 

Next,  we  turn  to  the  posterior  on  structure  p  {E\D^,  which  is  our  primary  quantity 
of  interest. 

Proposition  3.4.2  :  The  prior  p^  (E)  presented  in  Equation  3.64  conjugate  for  a 


TIM(r). 


Proof.  We  follow  a  similar  derivation  to  that  used  in  [62]  for  obtaining  the  posterior  on 


undirected  trees'.  Using  the  prior  on  structure  given  in  Equation  3.64,  we  start  with 
the  following  form: 


pi-E\V)  =  -^PT\ E)po(E) 


(3.82) 


1  i  N 

=  J(v)p  ( v ^  zed)  g  /W),w 


(3.83) 


7That  is,  while  here  we  are  deriving  the  posterior  for  a  different  and  much  broader  class  of  structures, 
the  steps  in  the  derivation  are  similar  to  [62]. 
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Next,  we  examine  the  data  evidence  given  structure  by  integrating  out  parameters: 

P  ip\E)  =  Jp  (V\e,  E)  PO  (0| E)  d 0  (3.84) 


T  N 


II II  v  e„|pa(„,j5))  po  (e„|P,M  it) 


U=1  v=l 

N  n  T 


d@ 


(3.85) 


V=1  t= 1 

N 


n  /  np  (i>”iB”’pa|”)’eHp=(<-,B))  p°  (e»ip«wiT)  de»ipaw  i3-86) 


'J  p  (^Dv\'Dv,pa('v\T 

V=1 

N 

Wpa(v),v 


(3.87) 

(3.88) 


The  data  evidence  given  structure  is  a  product  of  terms,  each  of  which  is  the  evidence 
of  the  u-th  time-series  given  its  parents  specified  by  E.  That  is,  for  parent  set  S  the 
evidence  for  the  u-th  time-series  is  : 


WS}V=p(vv  |Z^’S,t) 

=  Jp  (V|P"’S,  0„|S)  Po  (0,|S|T)  d0„|S 


(3.89) 


As  mentioned  in  Chapter  2,  for  continuous  observations  and  a  linear  gaussian  model 
with  parameters  0„|s  one  can  choose  po  (0„ig  |  Y)  to  be  a  matrix-normal-inverse-Wishart 
distribution  with  hyperparameters  T.  This  will  yield  efficient  updates  for  Equation  3.81 
and  ITj.ig  will  be  the  evaluation  of  a  Matrix- T  distribution.  For  discrete  observations, 
one  can  use  Dirichlet  priors  and  have  analytic  forms  for  the  evidence.  Specific  examples 
will  be  given  in  Chapter  6. 

Substituting  Equation  3.88  into  Equation  3.83  one  obtains 


1  1 
p(V)Z(f3) 


N 

ftpa.(v) iV^^pa^v) ,v • 

v=l 


p(E\V) 


(3.90) 
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Using  the  fact  that  the  distribution  must  sum  to  1  we  solve  for  the  data  evidence  p  (T>): 


1 

1 

1 

p(V) 


E 

E 

(3.91) 

1  1  N 

p  ( V )  Z((3)  ^  n  ^ pa(v),vW  pa(v),v 

(3.92) 

PUzwz(fioW) 

(3.93) 

Z(f3oW ) 

Z(f3)  ’ 

(3.94) 

where  o  is  an  element  wise  (Hadamard)  product.  Each  /3s, v  is  multiplied  by  Ws,v  This 
yields  the  posterior, 


p(E\V) 


1 

Z(/3  o  W) 


N 

^pa(t),u^pa(!)),r 

v=l 


(3.95) 


Thus,  the  prior  is  conjugate  and  the  posterior  is  obtained  by  updating  the  hyperparam¬ 
eters  (3  via  multiplication  of  data  evidence  weights  W .  □ 


Structure  Event  Probabilities  and  Expectations 

The  ability  to  compute  the  partition  function  and  conjugacy  of  the  prior  enables  exact 
computation  of  a  wide  variety  of  useful  prior  and/or  posterior  event  probabilities.  Here, 
we  present  some  examples  in  terms  of  the  prior  distribution  on  structure.  However, 
conjugacy  allows  one  to  convert  these  to  posterior  events  by  substituting  (3oW  in  place 
of  13. 

The  probability  of  a  particular  edge  being  present  is: 

p  (Iu^v  =  1)  =  E  [Iu^v]  =  1  -  U  j  (3.96) 

where  /u_>„  is  an  indicator  variable  that  has  value  1  when  the  edge  u  —>  v  is  present. 
f3~(u—’"v>  is  f3  with  all  elements  involving  edge  from  u  to  v  set  to  zero.  In  words,  Equation 
3.96  calculates  the  probability  of  an  edge  as  1  minus  the  probability  of  that  edge  not 
appearing.  Interpreting  the  partition  function  as  a  weighted  count  of  possible  structures, 
the  probability  of  an  edge  not  appearing  is  the  ratio  of  a  weight  count  of  structures 
which  do  not  include  that  edge  to  the  weighted  count  of  all  structures. 

Using  the  same  approach  one  can  calculate  the  joint  edge  appearance  probability  of 
one  set  of  edges  conditioned  on  another  set.  Similarly,  one  can  calculate  the  probability 
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a  time-series  has  no  parents  (is  a  root)  or  no  children  (is  a  leaf): 

£(/?-“(«) ) 


P  (Iv  is  a  root)  — 


P  (Iv  is  a  leaf)  — 


m 

Z(p-out(v)) 

m 


(3.97) 

(3.98) 


where  in(u)  and  out(u)  return  the  set  of  all  edges  in  and  out  of  time-series  v  respectively. 
P~e  indicates  all  elements  of  [3  which  involve  any  edge  in  the  set  e  are  zero. 

The  indicator  variables  used  in  the  examples  above  can  be  expressed  as  a  general 
multiplicative  functions  of  the  form 

N 

9 (E)  =  5pa(„),„-  (3.99) 


V=1 


The  expected  value  of  a  general  multiplicative  function  can  be  calculated  by: 
E[g(E)}  =  J>0  {E)g(E) 


1 


N 


N 


Ewmll  /^pa  (i),i  n  Spa  (j),j 

E  i= 1  j= 1 


1 


N 


En  /^pa(  v),v9pa(v),v 


z{P°g) 

m 


(3.100) 

(3.101) 

(3.102) 

(3.103) 


Note  that  variance  or  other  higher  order  moments  of  multiplicative  functions  can  also 
calculated  in  this  manner  ( e.g .  using  Z([3  o  g2)  in  calculating  posterior  variance). 

In  addition,  one  can  calculate  the  expectation  of  additive  functions  of  the  form 

N 

f{E)  =  YJfpa{v),v  (3.104) 

V=1 

Additive  functions  allow  calculation  of  quantities  such  as  the  expected  number  of  chil¬ 
dren  or  parents  of  a  particular  time-series.  For  example,  by  setting  fs,v  to  |S|  for  a 
single  v  and  all  other  f,tU^v  =  0,  f{E)  will  count  the  number  of  parents  vertices  v  has 
in  structure  E. 

For  E  G  A  or  E  €  Bm  the  expectation  takes  the  form: 

E  [/(«)]  =  E2w1 


(3.105) 
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For  directed  trees,  E  €  T, 

E  [/(£)]  =  ( Mr,r  (Q(P  o  /))  Mr,r  (Q(/3))_1)  (3.106) 

where  Mij  (M)  is  the  matrix  M  with  its  ith  row  and  jth  column  removed.  A  similar 
form  is  obtained  for  directed  forests.  Derivations  of  Equations  3.105  and  3.106  can  be 
found  in  Appendix  B.3. 

Maximum  a  Posteriori  Structure 

If  a  point  estimate  of  structure  is  desired,  using  prior  information,  one  can  turn  to  a 
MAP  estimate.  It  is  easy  to  see  from  Equation  3.95  that  the  MAP  directed  structure 
E  is  obtained  via 

N 

E*  =  argmax  Ppa(v,E),vWpa(v,E),v  (3-107) 

E  v=l 

When  E  €  An  or  E  £  each  vertex  v  can  be  treated  independently.  That  is,  E*  can 
be  found  by  N  independent  optimizations,  each  of  which  finds  the  maximum  /3s,v^S,v 
over  all  parent  sets  S  for  time-series  v. 

For  directed  forests  and  trees  the  optimization  becomes  more  complex  due  to  the 
global  acyclic  constraint  on  E.  It  is  equivalent  to  the  max  weighted  directed  tree 
problem.  Solutions  to  this  problem  were  developed  independently  by  Chu  and  Liu  [18], 
Edmonds  [25]  and  Bock  [10].  Appendix  A. 3.1  gives  the  details  of  their  algorithm. 

Algorithm  for  Bayesian  Inference  over  Structure 

In  the  previous  sections  we  described  how  to  perform  exact  Bayesian  inference  on  the 
structure  E  given  observed  data  V  using  a  TIM.  Algorithm  3.4.1  summarize  the  neces¬ 
sary  steps  when  analyzing  N  time-series. 

It  is  important  to  note  that  each  evidence  term  VFs)tJ  in  Equation  3.89  is  calcu¬ 
lated  using  the  same  data  V.  In  Section  3.3.2,  we  discussed  how  the  ability  to  exploit 
parametric  differences  is  lost  when  point  estimates  of  parameters  are  obtained  from  a 
single  V.  Additionally,  issues  arose  when  comparing  increasing  expressive  structures 
using  these  point  estimates  of  parameters.  Equation  3.89  avoids  some  of  these  issues 
by  integrating  over  all  parameters  rather  than  obtaining  point  estimates.  In  particular, 
the  integral  has  a  built  in  penalty  for  larger  parent  sets  which  would  result  in  more 
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Algorithm  3.4.1  Bayesian  Inference  of  TIM  Structure  for  Static  Dependence  Analysis 
Require:  Observations  of  N  time-series  in  T>. 

%  Define  a  TIM 

Choose  r 

Choose  the  set  in  which  E  belongs,  (e.g.  E  e  An)- 
Choose  a  parameterization  for  each  p  ^x/|x/’pa^\  0„|pa(i;))  ■ 

Choose  a  conjugate  prior  po  (©dpa(u)T)  f°r  each  conditional. 

Using  these  prior  terms  to  form  po  (©l-E)  using  Equation  3.63. 

Choose  /3s  and  form  po  (E)  using  Equation  3.64. 

%  Get  the  posterior 
for  all  v  G  { 1 , . . . ,  N}  do 

for  all  valid  parent  sets  S  do 

Calculate  Ws,v  using  Equation  3.89 

end  for 
end  for 

Use  Equation  3.95  to  form  the  posterior  p  (E \V) 

%  Output  desired  form  of  result 

if  A  point  estimate  is  desired  then 

Calculate  and  report  the  MAP  estimate  of  E 
else  if  Posterior  structural  events,  expectations  and/or  marginal  probabilities  are 
desired  then 

Report  them  using  Equations  3.96  through  3.106 
else 

Report  the  full  posterior  p(E\T>) 

end  if 


expressive  structures.  That  is,  the  integral  is  over  the  full  space  of  parameters  given 
a  parent  set  S.  More  elements  in  S  result  in  a  larger  space  of  parameters  to  integrate 
over.  The  posterior  over  this  space  must  integrate  to  1  and  thus  each  unormalized  pos¬ 
terior  term  in  the  integral  contributes  less  to  the  evidence  as  |S|  increases.  However,  it 
remains  the  case  that  even  this  Bayesian  approach  loses  something  by  using  the  same 
data  T>  to  calculate  evidence  for  all  substructures.  Knowing  the  true  parameters  which 


82 


CHAPTER  3.  STATIC  DEPENDENCE  MODELS  FOR  TIME-SERIES 


generated  the  data  will  always  yield  a  better  result.  That,  is  if  the  true  underlying 
parameters,  0„|g,  were  known  we  could  simply  use  Ws,v  =  P  (T>V\'DV,S1  0„|S).  It  is  im¬ 
portant  to  note  that  in  the  Bayesian  context  we  do  not  assume  the  exists  of  a  “true  set” 
of  parameters,  by  instead  a  distribution  over  them  with  each  new  draw  of  T>  potentially 
using  a  different  0  drawn  from  this  distribution. 

■  3.5  Summary 

In  this  chapter  we  introduced  the  general  class  of  static  dependence  models  which  are 
specified  by  fixed  dependence  structure  and  associated  parameters.  We  discussed  the 
main  challenges  associated  with  performing  structural  inference  on  this  class  of  models, 
particularly  when  the  parameters  are  unknown  and  treated  as  a  nuisance.  Namely,  the 
challenge  of  specifying  priors  on  a  large  set  of  structures  and  integrating  over  unknown 
parameters.  Two  different  specific  static  dependence  models  were  introduced.  The 
challenges  associated  with  inference  were  addressed  in  a  different  way  for  each  model. 

A  FactM  describes  time-series  in  terms  of  independent  groupings.  We  will  use 
such  models  for  reasoning  over  a  finite  set  of  associations,  thus  keeping  the  space  of 
structures  tractable.  In  the  absence  of  prior  information  we  presented  an  ML  approach 
to  structural  inference  and  showed  how  it  is  an  approximation  to  the  MAP  solution 
which  avoids  integrating  over  unknown  parameters.  An  alternative  hypothesis  testing 
view  of  ML  inference  was  presented  which  exposes  the  role  of  structure  and  parametric 
differences  when  choosing  among  FactMs.  This  analysis  allowed  us  to  characterize  what 
is  lost  when  performing  ML  inference.  That  is,  it  was  shown  that  parametric  differences 
cannot  be  exploited. 

A  TIM  uses  a  directed  structure  to  specify  more  detailed  causal  relationships  among 
time-series.  A  conjugate  prior  on  structure  and  parameters  was  presented  which  allowed 
for  exact  Bayesian  inference  using  a  TIM.  This  prior  allows  one  to  reason  over  a  set  of 
structures  which  is  super-exponential  in  the  number  of  time-series  in  exponential-time, 
in  general.  Furthermore,  it  was  shown  that  by  imposing  simple  local  or  global  structural 
constraints,  computation  was  reduced  to  polynomial-time  complexity.  The  ability  to 
calculate  the  exact  posterior  further  allows  one  to  calculate  exact  marginal  posterior 
event  probabilities  and  statistics. 

Note  that,  while  we  presented  two  different  approaches  for  inference  for  these  two 
models,  it  is  straightforward  to  define  appropriate  priors  and  perform  exact  Bayesian 
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inference  using  a  FactM  and  one  can  easily  perform  ML  estimation  on  a  TIM.  The 
specific  inference  technique  we  chose  to  present  for  each  model  is  directly  a  consequence 
of  the  applications  in  which  these  models  will  be  used  in  Chapters  5  and  6. 
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Chapter  4 


Dynamic  Dependence  Models  For 

Time-Series 


In  Chapter  3,  we  presented  two  static  dependence  models  and  examined  how  they  could 
be  used  to  analyze  the  dependence  relationships  among  multiple  time-series.  These 
models  assumed  that  the  dependence  relationship  among  time-series  was  fixed  over 
all  time.  In  this  chapter,  we  remove  the  static  structure  assumption  and  look  at  the 
problem  of  dynamic  dependence  analysis.  That  is,  given  multiple  time-series  we  wish  to 
analyze  the  way  in  which  their  dependence  structure  evolves  over  time.  As  we  will  see, 
incorporating  the  notion  of  dynamically  evolving  dependence  structures  complicates 
inference.  In  the  dynamic  framework,  we  consider  both  point  estimates  and  a  full 
characterizations  of  posterior  uncertainty  on  structure  over  time. 

As  discussed  in  Chapter  3,  the  number  of  possible  structures  used  in  a  static  depen¬ 
dence  model  generally  grows  super-exponentially  with  the  number  of  time-series  being 
analyzed.  An  obvious  challenge  for  a  dynamic  dependence  analysis  tool  is  that,  by  al¬ 
lowing  the  structure  to  change  over  time,  the  number  of  possible  sequences  of  structure 
increase  exponentially  as  the  number  of  time  points  grows.  That  is,  if  on  considers  S 
allowable  structures  at  each  time  point,  there  are  ST  possible  sequences  of  structure 
over  a  period  of  T  time  points. 

In  this  chapter,  we  discuss  ways  to  efficiently  reason  over  these  sequences  of  changing 
dependence  relationships.  We  assume  dependence  relationships  have  some  temporal 
persistence  and  that  they  may  be  revisited  over  time.  These  two  assumptions  are  useful 
in  that  they  simplify  aspects  of  inference  and  provide  benefits  in  terms  of  estimation 
quality.  Two  applications  are  explored  in  Chapters  5  and  6  in  which  both  assumptions 
are  reasonable.  For  example,  in  an  audio-visual  speech  association  task  it  is  common  to 
assume  that  individuals  produce  continuous  speech  segments  rather  than  short  bursts 
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of  sound.  It  is  also  expected  that  each  person  will  speak  multiple  times  during  a 
conversation. 

We  begin  in  Section  4.1  with  a  discussion  of  standard  windowed  approaches  for  dy¬ 
namic  dependence  analysis.  Next,  we  introduce  the  concept  of  a  dynamic  dependence 
model  in  Section  4.2.  These  models  extend  the  static  dependence  models  presented  in 
Chapter  3  via  the  introduction  of  a  dynamically  evolving  latent  state  variable.  The 
state  variable  indexes  structure,  allowing  switching  among  a  finite  set  of  dependence 
relationships  over  time.  In  Section  4.3,  we  detail  inference  using  a  dynamic  dependence 
model  in  a  maximum  likelihood  setting  in  which  a  point  estimate  of  the  sequence  of 
active  structures  is  desired.  Following  a  similar  analysis  to  that  presented  in  Section 
3.3.2  we  show  that,  in  contrast  to  static  dependence  models,  dynamic  dependence  mod¬ 
els  allow  one  to  take  advantage  of  parametric  differences  to  help  identify  structure.  A 
set  of  illustrative  examples  is  provided  in  Section  4.2.1.  In  Section  4.4  we  introduce  a 
tractable  prior  for  dynamic  dependence  models  and  discuss  Bayesian  inference  of  struc¬ 
ture  sequences.  Lastly,  in  Section  4.5  we  discuss  our  model  in  the  context  of  related 
work. 

■  4.1  Windowed  Approaches 

Using  our  assumption  that  dependence  relationships  do  not  switch  rapidly,  one  approach 
for  performing  dynamic  dependence  analysis  is  to  treat  the  problem  as  a  series  of  static 
dependence  inference  tasks.  That  is,  one  can  start  by  assuming  the  structure  is  fixed 
within  a  window  of  time  tw  =  [t  —  w/ 2, . . . ,  t, . . .  t  +  w/2].  Within  this  window,  static 
dependence  analysis  as  described  in  Chapter  3  can  be  performed  on  Vtw  .  In  order  to  deal 
with  the  structure  changing  over  time,  this  window  is  moved  forward  to  a  future  time 
point  t+5  and  analysis  is  repeated.  Such  an  approach  treats  each  window  independently, 
transforming  the  inference  problem  from  one  of  inference  over  an  exponential  number 
of  sequences  of  structure  to  one  which  is  linear  in  the  number  windows. 

There  are  two  main  issues  that  one  must  consider  when  using  a  windowed  approach. 
First  is  the  open  question  of  what  window  size  to  pick.  There  is  a  bias- variance  tradeoff 
as  a  function  of  window  size.  That  is,  long  windows  bias  your  result  since  they  more 
likely  to  contain  times  in  which  structure  changes.  Within  shorter  windows  structure  is 
more  likely  to  be  stationary,  but  as  a  consequence  of  using  fewer  samples  the  estimate 
of  this  structure  is  less  accurate  /  has  higher  variance.  In  fact,  there  is  no  fixed  window 
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size  that  is  guaranteed  to  never  violate  the  static  dependence  assumption.  A  fixed 
length  window  will  eventually  overlap  a  change  of  structure  as  it  is  moved  forward  over 
time. 

A  second  issue  with  windowed  approaches  is  that  they  inherit  the  challenges  as¬ 
sociated  with  structural  inference  for  static  dependence  models  due  to  their  myopic 
windowed  view  of  data.  That  is,  the  inference  performed  for  a  particular  window  only 
uses  information  from  the  data  observed  within  that  window.  By  definition,  a  win¬ 
dowed  approach  cannot  take  advantage  of  information  from  data  observed  in  the  past 
or  future  which  may  share  the  same  structure  and/or  parameters  as  the  data  observed 
within  the  window  analyzed.  As  discussed  in  Section  3.3.2,  parametric  differences  can 
improve  structural  inference  due  to  increased  separability  of  hypothesized  models.  In 
a  windowed  approach,  parameters  for  different  structures  are  estimated  or  integrated 
over  using  the  same  data  for  each  hypothesized  structure.  Thus,  such  an  approach 
cannot  uncover  or  exploit  the  true  underlying  parametric  differences.  In  addition,  if 
a  maximum  likelihood  approach  is  utilized,  significance  must  also  be  estimated  when 
reasoning  over  nested  hypothesized  structures.  That  is,  one  can  always  increase  the 
likelihood  by  using  a  more  complex  model  and  thus  one  must  turn  to  estimation  of 
significance  when  deciding  if  a  less  complex  model  should  be  rejected.  These  issues  are 
explored  and  discussed  using  simple  illustrative  experiments  in  Section  4.3.2 

■  4.2  Dynamic  Dependence  Models 

In  this  section,  we  present  the  concept  of  a  dynamic  dependence  model  (DDM).  A  DDM 
explicitly  models  evolving  dependence  structure  among  time-series.  We  will  show  that 
dynamic  dependence  analysis  can  be  mapped  to  inference  using  this  model  rather  than 
to  a  series  of  independent  static  dependence  inference  tasks.  A  DDM  can  be  represented 
as  a  dynamic  Bayesian  network  as  depicted  in  Figure  4.1.  A  hidden  discrete  state  at 
time  t,  zt,  indexes  one  of  K  specific  structures,  Ezt,  and  parameters,  02t.  The  observed 
time-series  at  time  t  are  modeled  using  a  static  dependence  model,  as  presented  in 
Chapter  3,  with  structure  specified  by  EZt  and  parameters  specified  by  02t.  Assuming 
r-th  order  temporal  dependence  and  K  possible  states,  the  generative  model  for  a 
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Figure  4.1.  First  Order  Dynamic  Dependence  Model:  A  graphical  model  representing  a  generic  first 
order  (r  =  1)  dynamic  dependence  model  is  shown.  The  square  box  around  the  structure  Ek  and 
parameters  0fe  is  a  plate  indicating  there  are  K  independent  copies  of  these  parameters  and  structure. 

7r  is  the  set  of  parameters  describing  the  transition  probabilities  for  evolving  latent  state  z. 

DDM(r,K)  over  a  time  period  t  =  {1, . . . ,  T}  takes  has  the  following  form: 

P  (Xt,  Zt |E,  0)  =  p(Xt\zt,E,0)p(zt)  (4.1) 

T 

=  Hp(xt\Xt,EZt,eZt)p(zt\7Tz^),  (4.2) 

t=i 

where  zo  =  0,  E  =  {E1, . . . ,  EK  }  is  a  set  of  structures,  and  ©  =  {t0,  . . . ,  irK ,  01, . . . ,  0A  } 
is  a  set  of  all  parameters  for  the  model.  The  state  sequence,  Z\:t,  is  modeled  as  a  first 
order  Markov  process  with  a  discrete  transition  distribution, 

p(zt\zt-i,&)  =  p{zt\'itZt~1)  (4.3) 

=  (4-4) 

The  parameters  ir3  =  {7r(, . . . ,  nJK}  are  the  transition  probabilities  to  all  K  states  from 
state  j .  Each  state  k  indexes  an  r-th  order  static  dependence  model  which  defines 
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the  form  of  p  ^Xt|Xt,  Ek ,  Qkj .  If  a  FactM  is  used  with  the  structure  specified  in 
terms  of  a  factorization/hyperedges  F,  we  refer  to  the  DDM  as  a  hidden  factorization 
Markov  model,  HFactMM(r,It )  [81].  If  a  TIM  is  used  with  structure  specified  in  terms 
of  directed  edges  E,  we  refer  to  the  DDM  as  a  switching  temporal  interaction  model, 
STIM (r,K)  [82], 

Note  that  a  DDM  assumes  that  the  number  of  states,  K,  is  known  and  that  struc¬ 
tures  (and  parameters)  can  be  revisited  over  time.  We  note  that  the  static  dependence 
models  presented  in  Chapter  3  can  also  easily  be  embedded  into  other  alternative  dy¬ 
namic  models.  For  example,  one  may  use  a  parts  partition  model  (PPM)  similar  to  that 
used  by  Xuan  and  Murphy  [94].  A  PPM  allows  for  an  unknown  number  of  states  that 
are  never  revisited.  Similarly,  one  may  adopt  nonparametric  Bayesian  models  such  as 
the  hierarchical  Dirichlet  process  hidden  Markov  model  (HDP-HMM)[87]  to  allow  for 
an  unknown  number  of  potential  revisited  states.  The  appropriate  choice  of  model  is 
highly  dependent  on  the  how  well  the  underlying  assumptions  of  the  model  match  the 
application  of  interest.  While  there  are  interesting  details  within  each  of  these  modeling 
choices,  here,  we  focus  on  dynamic  estimation  when  the  number  states  is  know  and  each 
state  is  likely  to  be  revisited.  That  is,  the  applications  presented  in  this  dissertation 
focus  on  the  use  of  DDMs,  leaving  the  above  straightforward  modifications/extensions 
for  future  work. 

■  4.2.1  Dynamic  Dependence  Analysis  Using  a  DDM 

We  now  turn  to  the  core  task  of  dynamic  dependence  analysis.  Given  observations  T> 
of  multiple  time-series  for  over  a  period  of  time  t  =  {1, . ..  ,T},  the  goal  of  dynamic 
dependence  analysis  is  to  characterize  the  dependence  among  these  time-series  at  each 
point  in  time  t  €  t.  Dynamic  dependence  analysis  can  be  carried  out  using  a  DDM. 
That  is,  using  a  DDM  we  are  interested  in  inferring  information  about  z\-_t  and  E  given 
observed  data.  The  set  E  contains  the  dependence  structure  for  each  of  the  K  states 
and  zi:t  indicates  which  state  is  active  at  each  point  in  time.  If  prior  information  is 
available,  ideally,  one  would  like  to  calculate  the  posterior 


(4.5) 


From  this  joint  posterior  on  the  set  of  structures  E  and  state  sequence  z\-t,  a  wide 
variety  of  useful  statistics  can  be  calculated.  For  example,  the  MAP  state  sequence  and 
structures  can  be  obtained.  Unfortunately,  no  simple  analytic  form  exists  for  Equation 
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4.5  and  brute  force  calculation  is  intractable  due  to  the  size  of  the  joint  space.  There 
are  KT  possible  state  sequences,  zi-.t,  and  potentially  a  super-exponential  number  of 
structures  for  each  of  the  K  elements,  Ek ,  in  E. 

Fortunately,  however,  DDMs  have  a  well  studied  structure.  They  are  nothing  more 
than  specialized  hidden  Markov  models  (HMMs)  or  switching  vector  autoregressive 
(SVAR)  models.  They  differ  from  these  standard  models  in  that  the  form  of  the  de¬ 
pendence  among  the  observations  is  controlled  by  the  values  of  the  hidden  structure 
E1.  In  the  following  two  sections  we  discuss  both  maximum  likelihood  and  Bayesian 
approaches  to  structural  inference  using  a  DDM.  Both  approaches  will  take  advantage 
of  two  key  properties  of  the  model: 

1.  The  likelihood  p(T> |E,@)  as  well  as  the  posterior  on  each  state  zt  given  the 
data  and  all  other  unobserved  random  variables,  p  (zt  \D,  E,  0),  can  be  calculated 
tractably  using  standard  forward-backward  message  passing  [4] . 

2.  Given  the  state  sequence  z\.t  one  can  pool  information  from  time  points  that  share 
a  common  state  k  in  order  to  help  one  infer  the  structure  Ek  and  parameters  Qk. 

■  4.3  Maximum  Likelihood  Inference 

In  the  absence  of  prior  information,  one  can  adopt  a  classical  maximum  likelihood 
approach  for  estimating  the  unobserved  structures  E  and  parameters  0.  That  is, 
similar  to  the  approach  presented  in  Section  3.3.2,  we  wish  to  find 

{E,  0}  =  argmaxp  (X?|E,  0)  (4.6) 

E,0 

=  argmaxV' p  (V\zi:t,  E,  0)  p0  (zi:t\@)  ■  (4.7) 

E,0 

Zl:T 

Given  an  estimate  of  the  structures  E  and  parameters  0,  one  can  then  find  the 
MAP  state  sequence: 

Zi-t  =  argmaxp  (zi-t\E>,  E,  0 )  .  (4.8) 

Again,  as  discussed  in  Section  3.3.2,  such  an  approach  is  related  to  generalized  likeli¬ 
hood  methods  for  dealing  with  nuisance  parameters.  We  will  discuss  the  details  of  the 
optimization  in  Equation  4.7  in  following  section  along  with  an  analysis  of  how  state 
sequences  distinguish  themselves  from  each  other  when  calculating  Equation  4.8. 


1We  will  put  our  model  in  the  context  of  previous  work  in  Section  4.5 
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■  4.3.1  Expectation  Maximization 

The  optimization  in  Equation  4.7  is  complicated  by  the  fact  that  one  must  sum  over 
all  Kt  possible  state  sequences  and  no  closed  form  analytical  solution  exists.  In  this 
dissertation,  we  use  the  standard  approach  of  Expectation  Maximization  (EM)  to  find 
a  local  maxima  of  p(D|E,0)  [23].  Given  an  initial  estimate  of  the  structure  and  pa¬ 
rameters,  and  the  EM  algorithm  iteratively  updates  its  estimate  by  repeating 
two  basic  steps.  For  iteration  i,  these  two  steps  are: 

1.  E-Step:  Find  a  function  which  is  a  tight  lower  bound  for  p  (T> 

2.  M-Step:  Given  this  function,  find  a  new  set  of  structures  and  parameters,  E)®-* ,  ©W , 
which  maximize  it. 


(c.f.  [65]).  For  a  DDM  (or  general  HMM  or  SVAR),  the  function  which  provides  the 
tight  bound  is  simply  the  posterior  p  (zi-.tI'D,  0(*_1)) .  The  M-Step  is  a  series  of 

K  independent  maximization  problems  each  of  the  form: 

T 

0*(O}  =  argmax  V7tfc  log  p  (vt\ Vt,Ek,Qk)  (4.9) 

where  7tfc  was  calculated  in  the  E-Step  using  standard  forward-backward  message  pass¬ 
ing  [4,  76]  such  that 

7tfc  =p(zt  =  k\D,  E^1),  .  (4.10) 


That  is,  the  M-Step  does  a  weighted  ML  estimate  of  structure  and  parameters  for  each 
state.  The  weights  are  a  function  of  the  posterior  probability  of  being  in  state  k  at  each 
point  in  time.  For  a  DDM,  the  weighted  ML  step  is  performed  on  a  static  dependence 
model.  For  example,  if  one  is  using  an  HFactMM,  the  optimization  in  Equation  4.9 
would  be  a  simple  modification  of  Equation  3.19  in  Section  3.3.2.  That  is,  for  each 
state  k,  the  M-Step  is: 


{ 


=  argmax 
F,e 


=  argmax 
F 


T  \F 


Y  7 1  log  p  (Vt  f  1  Vtf  ’  @Ff  ) 

(4.11) 

t= 1  /= 1 

\F\  T 

arg  max  Y  7 1  log p  (vff  j Vp  ,0Ff) 
f=l  &Ff  4=1 

(4.12) 
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where  we  have  switched  to  the  notation  F  to  indicate  we  are  specifically  discussing  ML 
inference  on  an  HFactMM  here.  This  optimization  can  be  expressed  as  two  steps.  First, 
one  must  find  the  maximum  weighted  likelihood  for  each  factor: 

T 

@Ff  =  arg  max  7*  log  p  (vff  \ V?f ,  0F/  )  (4.13) 

0i7  t= 1 

=  arg  max  WZ  (0F  )  (4.14) 

eFf  f 

For  Gaussian  distributions  the  parameters  0 p}  will  be  weighted  estimates  of  the  mean 
and  covariance,  while  for  discrete  distributions  weighted  counts  will  be  used  for  esti¬ 
mating  symbol  probabilities.  Given  Wp  (Qpf)  for  every  allowable  factor  Ff,  Equation 
4.12  becomes 

IG 

|  Fk ,  Qk  |  =  arg  max  ^  W],f  (0Ff ) .  (4.15) 

F’e  f= 1 

In  addition  to  estimating  structure  and  parameters  0,  the  M-Step  also  updates  the 
transition  distribution  parameters  7r°, . . .  irk.  The  E-step  and  M-step  are  iterated  until 
convergence.  Convergence  can  be  defined  in  terms  of  stable  parameter  estimates  or 
overall  likelihood  of  the  data  given  the  current  parameter  estimates.  A  summary  of  the 
EM  algorithm  for  a  DDM  is  given  in  Algorithm  4.3.1  and  forward  backward  message 
passing  is  outlined  in  Algorithm  4.3.2  .  In  practice  we  run  multiple  EM  optimizations, 
each  starting  with  a  different  random  initialization.  The  parameters  with  the  highest 
likelihood  over  all  initializations  is  taken  as  the  final  result.  We  refer  the  reader  to  [76] 
for  a  basic  tutorial  on  EM  for  HMMs. 

Analysis  of  Parametric  Differences 

The  MAP  state  sequence  in  Equation  4.8  can  be  calculated  efficiently  via  dynamic 
programing  using  the  Viterbi  algorithm  [92,  31].  Viterbi  decoding  implicitly  performs 
an  M-ary  hypothesis  test  comparing  all  M  =  I\T  possible  state  sequence.  Much  like  the 
analysis  shown  in  Section  3.3.2,  this  alternative  hypothesis  testing  view  helps  expose 
how  state  sequences  distinguish  themselves  from  each  other. 

We  examine  how  state  sequences  drawn  from  an  HFactMM(0,iP)  with  corresponding 
estimates  of  structures  F  and  parameters  0  distinguish  themselves  from  each  other.  We 
examine  the  case  in  which  r  =  0  here  for  simplicity  and  to  match  with  our  experiments 
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Algorithm  4.3.1  The  EM  Algorithm  for  a  Dynamic  Dependence  Model 
Require:  N  observed  time-series  from  1  :  T  in  V,  and  specific  parameterization/choice 
of  a  DDM. 

%  Initialize 

Randomly  set  E)°)  and  0® 
i  <-  0 

%  Expectation  Maximization 
repeat 

*  <—  i  +  1 

%  E-Step 

{7,£}  ForwardBackward(D,  E^1),  G^-1),  A)  %  See  Algorithm  4.3.2 
%  M-Step 

M»- ••>**: }(<)  {7i.---.7f} 

for  k  =  1  to  K  do 

{E^),©^)}  <-  arg maxE  0  It  logp  (vt\ Vt,E,@^ 

for  j  =  1  to  K  do 

'f’-ESTiRh/ESTiSf 

end  for 
end  for 

until  Convergence 

%  Report  Results 

Output  EL)  and  ©M  along  with  a  calculation  of  p  (D|E)l\  ©M). 


in  Chapter  5.  Consider  a  binary  hypothesis  test  between  two  different  state  sequences 
in  which 


Hi  :  Zl:T  =  CLl:T  (4.16) 

H2  '■  Zl:T  =  bi-.T  (4.17) 

(4.18) 

We  define  a  common  factorization,  Fnt,  in  similar  manner  to  that  shown  in  Section 
3.3.2.  That  is  ,  Fnt  is  the  factorization  common  to  the  factorizations  Fat  and  Fbt . 
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Algorithm  4.3.2  The  Forward-Backward  Algorithm 
function  ForwardBackward(T>,  E,  0,  K) 

Let  p^p(vt\Vt,Ek,ek^ 

%  Forward:  ak  =  p  Zt  =  fc|E,  0) 
for  k  =  1  to  K  do 

a\  <—  pk n®  %  Initialize  forward  messages 
end  for 

for  t  =  1  to  T  —  1  do 
for  k  =  1  to  K  do 


end  for 
end  for 

%  Backward:  (3k  =  p  (p (t+i)-.T\zt  =  k,  E,  0) 

for  k  =  1  to  K  do 

/3j>  <—  1  %  Initialize  backward  messages 
end  for 

for  t  =  T  —  1  down  to  1  do 
for  k  =  1  to  K  do 

end  for 
end  for 

%  Calc  7^  =p(zt  =  k\T>,  E,  0)  and  $’j  =p(zt  =  k,  zt+ i  =  j\T>,  E,  0) 

for  t  =  1  to  T  —  1  do 
for  k  =  1  to  K  do 

7?  -  cfftV  (E?,i  77) 

for  j  =  1  to  K  do 

^  <“  (a? TTj-Pt+i^t+i)  /  (Eto=1  En=l  <0?+l) 

end  for 
end  for 
end  for 

return  7  and  £ 
end  function 
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Given  learned  parameters  and  structures  the  hypothesis  test  takes  following  form: 


h, 2  =  log  ■ 


p  P|ai:T,F,0 


HI 

> 


p  log- 


P  ai:T|© 


p(v\bi:T,F,@J  p(b1:T\® 


(4.19) 


The  test  compares  the  log  likelihood  ratio  to  a  threshold  formed  from  comparing  the 
dynamics  of  the  two  hypothesized  state  sequences.  Using  the  same  form  as  Equation 
3.36  one  can  show  that  in  expectation  under  Hi,  with  r  =  0,  the  likelihood  ratio  is 


ha\Hi 


(  p(vt\Fa\QaA  p(vt\Fnt,Qat 


£E  d 


+  ^2D(  p(vt\Fnt,®at)  p  (vt\Fbt,  Qbt 


t£  d 


and  similarly  when  H2  is  true 


ii,2\H2]  =-Y,D(p{Dt\Fbt^bt)  \\p(vt\Fnt,Qbt)  ) 


(  p(vt\Fnt,QbA  p(vt\Fat,Qat 


(4.20) 


(4.21) 


where  d  =  {t  \  at  7^  6*}  is  the  set  of  all  time  points  in  which  the  hypothesized  state 
sequences  disagree.  See  Appendix  B.4  for  details.  Much  like  in  the  static  case,  we  see 
that  the  expected  log  likelihood  ratio  can  be  decomposed  into  two  terms.  The  first  term 
compares  structure  of  the  true  hypothesis  to  the  common  structure  under  a  consistent 
set  of  parameters.  The  second  term  exploits  both  structural  and  parametric  differences. 

However,  unlike  the  situation  in  the  static  case,  the  parameters  ®k  are  estimated 
via  the  weighted  maximum  likelihood  estimate  in  Equation  4.9.  Each  parameter  is 
estimated  using  the  observed  data  differently  and  thus  parametric  differences  can  be 
exploited.  That  is,  the  second  set  of  terms  will  not  drop  out  as  they  did  in  the  static 
case  as  discussed  in  Section  3.3.2. 


■  4.3.2  Illustrative  Examples 

In  the  previous  section  we  discussed  how  ML  inference  using  an  HFactMM  can  exploit 
both  structural  and  parameter  differences  among  the  time-series  to  help  identify  changes 
in  their  dependence  relationships.  In  this  section,  we  present  a  set  of  simple  illustrative 
experiments  in  order  to  provide  some  more  intuition  about  these  models  and  their  uses. 
That  is,  we  present  experiments  to  help  answer  the  following  questions: 
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1.  How  much  can  be  gained  by  using  an  HFactMM  over  standard  window  ap¬ 
proaches?  That  is,  how  is  the  performance  of  both  windowed  analysis  and  ML 
inference  using  a  HFactMM  affected  as  we  change  both  the  structural  and  para¬ 
metric  differences  between  the  dependence  relationships  found  in  the  data. 

2.  When  are  state  dynamics  important? 

3.  What  can  be  done  when  the  correct  model  parameterization  is  unknown? 

We  begin  with  a  simple  experiment  involving  two  ( i.e .  N  =  2)  1-d  time-series. 
We  create  a  HFactMM(r  =  0 ,K  =  2)  with  the  factorizations  for  each  state  set  to  be 
independent  F1  =  {{1},{2}}  and  dependent  F 2  =  {{1,2}}.  Since  r  =  0,  this  model 
produces  i.i.d.  samples.  Gaussian  factor  models  are  chosen  such  that  parameters  for 
state  k  =  1,  0{,©2,  describe  zero  mean,  unit  variance  distributions.  The  parameters 
for  state  2,  2)  are  set  such  that  the  joint  mean  is  [0  A]T,  and  the  covariance  is  full, 

with  unit  marginal  variance  and  correlation  p.  The  transition  dynamic  parameters  are 
set  such  that  starting  in  either  state  is  equally  likely,  =  vr^  =  .5,  and  that  there  is 
0.95  probability  of  self  transition,  7r{  =  7r|  =  0.95,  7r}  =  tt\  =  0.05.  This  yields  a  model 
with  a  simple  state  dynamic  and  a  control  on  structural  and  parametric  differences  via 
p  and  D  respectively. 

For  each  setting  of  p  and  A,  200  samples  (i.e.  T  =  200)  of  the  joint  process  z\:t 
and  Xi :T  are  drawn.  That  is,  z\-t  is  drawn  using  the  transitions  distribution  and  then 
Xi  :T  is  drawn  to  form  observations  T>  given  this  state  sequence.  Figure  4.2  shows  one 
realization.  The  top  of  the  figure  shows  T>  colored  by  which  state  each  sample  came  from 
along  with  an  indication  of  how  the  Gaussian  conditional  models  are  parameterized.  A 
controls  the  separation  between  each  conditional  FactM  distribution  and  p  controls  the 
correlation  in  the  FactM  used  for  state  2.  The  bottom  of  the  figure  shows  the  sampled 
state  sequence  as  sequence  of  colors  representing  which  state  is  active  at  each  point  in 
time. 

We  compare  3  different  approaches  for  dynamic  dependence  analysis  on  this  data. 
The  first  is  an  ML  windowed  approach  reasoning  over  two  possible  factorization  models. 
The  second  approach  performs  ML  inference  using  a  HFactM(0,2)  with  the  correct 
structures  given  but  with  unknown  parameters.  The  third  performs  ML  inference  using 
a  modified  model  which  has  no  temporal  dynamic  on  the  state  sequence.  We  will  refer 
to  this  model  as  a  factorization  mixture  model,  FactMM(0,2).  Each  approach  outputs  a 
labeling  for  each  time  point  indicating  whether  or  not  the  observations  are  dependent. 
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Figure  4.2.  A  Sample  Draw  from  an  HFactMM(0,2):  The  top  part  of  the  figure  shows  the  samples 
colored  by  their  state  (black=state  1,  red=state  2).  The  parametric  differences  between  the  two  possible 
dependence  models  is  controlled  via  A,  and  the  structural  difference  controlled  by  p.  The  bottom  part 
of  the  figure  depicts  Zt  using  the  same  colors. 


This  is  compared  to  the  known  z±-t  to  calculate  performance. 

The  ML  windowed  approach  reduces  to  calculating  the  correlation  between  observed 
time-series  as  shown  in  the  example  in  Section  3.3.2  and  as  was  the  case  in  that  example, 
the  factorizations  of  interest  here  are  nested.  Thus  we  also  estimate  significance  via 
calculation  of  a  p-value  for  each  window.  Window  sizes  of  5,  10,  20,  and  40  samples 
were  each  tested.  We  record  results  obtained  in  an  unrealistic,  best-case  scenario  for 
the  windowed  analysis.  That  is,  for  each  data  set  analyzed,  we  find  the  threshold  on 
the  p-value  that  yielded  the  best  performance  for  each  window  size  and  then  reported 
the  best  result  over  all  window  sizes. 

Results  are  shown  in  Figures  4.3  and  4.4.  Each  plot  shows  the  average  probability 
of  error  over  100  trials  for  various  settings  of  A  and  p.  Consistent  with  the  analysis 
presented  in  Section  3.3.2,  Figure  4.3  shows  that  the  performance  of  the  windowed 
analysis  is  not  affected  by  changes  in  the  non-structural  parameter,  A.  The  slight 
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Figure  4.3.  2D  Gaussian  Experimental  Results  Using  a  Windowed  Approach:  Performance  is  mea¬ 
sured  in  terms  of  average  %  error  over  100  trials.  Results  are  shown  as  a  function  p  for  various  settings 
of  A. 


Figure  4.4.  2D  Gaussian  Experimental  results  Using  an  HFactMM  and  FactMM:  The  HFactMM 
results  are  shown  as  solid  lines  while  the  FactMM  results  are  shown  as  dashed  lines.  Performance  is 
measured  in  terms  of  average  %  error  over  100  trials.  Results  are  shown  as  a  function  p  for  various 
settings  of  A. 
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decreases  in  performance  as  A  increases  is  due  to  the  fact  that  larger  A  makes  the 
data  look  more  dependent  when  a  window  overlaps  a  transition  between  dependent  and 
independent  samples. 

As  predicted  by  our  analysis  in  Section  4.3  and  in  contrast  to  the  windowed  test, 
both  the  HFactMM  and  FactMM  dramatically  improve  as  A  increases.  In  general,  all 
approaches  improve  in  performance  with  increasing  p,  with  more  rapid  improvements 
for  the  HFactMM  and  FactMM  for  larger  A.  Dynamics  help  most  when  A  is  small,  i.e. 
when  the  state  conditional  distributions  overlap.  For  example,  when  A  =  2  the  gap 
in  performance  between  the  HFactMM  and  FactMM  is  particularly  large  and  there  is 
great  benefit  to  incorporating  dynamics  by  using  an  HFactMM. 

In  the  previous  experiment  we  used  a  simple  gaussian  parameterization  for  our 
state  conditional  distributions/FactMs.  More  complex  parameterizations  can  be  used. 
However,  as  the  next  example  shows,  certain  ambiguities  may  arise.  Such  ambiguities 
can  be  overcome  when  an  underlying  dynamic  is  present.  Consider  the  data  shown 
in  Figure  4.5(a).  This  data  was  generated  with  an  HFactMM(0,2)  using  the  same 
dependent  and  independent  factorizations  as  in  the  previous  example.  However,  in  this 
case  the  factor  models  are  no  longer  single  Gaussians.  When  the  data  is  independent 
it  is  sampled  from  the  product  of  two  Gaussian  mixture  models,  each  of  which  has  two 
mixture  components  each  with  spherical  unit  covariance.  The  result  is  a  four  component 
mixture  model  in  the  joint  space  (indicated  by  black  circles  in  the  figure).  When  the 
data  is  dependent  it  comes  from  a  four  component  joint  mixture  (indicated  by  the  red 
circles). 

Figures  4.5(b)  and  4.5(c)  show  FactMM  and  HFactMM  models  learned  from  200 
samples  of  this  mixture  model  using  the  true  parameterization  (i.e.  correct  number  of 
mixtures)  but  unknown  parameters.  Using  mixture  models  for  each  state  conditional 
FactM  complicates  the  M-Step  in  EM  slightly.  The  M-step  itself  must  use  an  embedded 
EM  step  to  learn  the  mixture  model  for  each  state. 

Note  that  for  this  particular  model  there  are  many  possible  combinations  of  inde¬ 
pendent  and  dependent  mixtures.  In  fact,  the  FactMM  model  estimated/learned  one 
of  these  alternative  mixtures  in  Figure  4.5(b).  This  is  because  by  assuming  indepen¬ 
dent  samples  and  ignoring  the  state  dynamic  all  valid  combinations  of  dependent  and 
independent  mixtures  are  equally  likely.  By  incorporating  dynamics,  the  HFactMM 
finds  the  correct  solution.  Over  100  trials  the  HFactMM  and  FactMM  models  had  an 
average  performance  of  64%  and  99%  accuracy  in  labeling  samples  as  dependent  or  not 
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respectively. 

A  lingering  question  is  what  to  do  when  the  “correct”  parameterization  is  not  known. 
One  approach  is  to  utilize  a  nonparametric  model  for  each  factor  in  the  state  conditional 
FactMs.  A  non-parametric  sample-based  kernel  density  estimate  (KDE)  can  be  used 
for  each  factor  /  of  factorization  Fk  such  that: 

T 


PU7\®k)  = 


3= 1 
T 


xf'*  - 


3= 1 


X 


pk 


-<V) 

(4.22) 

-xf'V) 

(4.23) 

where  K()  is  a  valid  kernel  function  with  a  kernel  size  ak  and  7^  is  defined  in  Equation 
4.10.  Note  here,  in  addition  to  specifying  ak ,  the  parameters  0  are  the  observations  in 
V,  with  Oj  =  Xj.  Given  mixture  data  used  in  the  previous  example,  Figure  4.5(e)  shows 
the  learned  HFactMM  model  using  a  KDE  with  a  Gaussian  kernel  for  its  state  condi¬ 
tional  distributions  .  The  figure  shows  the  difference  in  state  conditional  distributions 
at  each  point  in  the  observation  space:  p  =  2,  F1, 02^  —  p  \zt  =  1,  F1,  01  j  . 

Leave-one-out  likelihood  was  used  to  learn  the  kernel  size  crk.  It  is  important  to  note 
that  one  must  be  careful  when  using  more  powerful  state-conditional  distributions.  If 
a  single  state  conditional  distribution  is  flexible  enough  to  describe  all  of  the  data  and 
the  state  transition  probabilities  are  learned  an  HFactMM  can  describe  the  data  over 
all  time  well  using  a  single  state.  One  way  to  deal  with  this  issue  to  modify  learn¬ 
ing  to  favor  non-degenerate  state  transition  distribution  parameters  7 17  or  to  limit  the 
complexity  of  the  state  conditional  models. 

An  alternative  approach  to  using  nonparametric  density  estimators  is  to  operate 
on  vector  quantized  versions  of  each  time-series.  A  separate  codebook,  Cv,  for  each 
observed  time-series  Vv  is  obtained  via  vector  quantization.  This  can  be  done  using 
the  k-means  algorithm  or  fitting  a  Gaussian  mixture  model  (cf.  [24])  which  treats  each 
time  point  independently.  These  codebooks  are  then  used  to  encode  the  data  as  discrete 
time-series  V.  Each  time-series  is  quantized  separately  prior  to  dependence  analysis. 
Figure  4.6  depicts  this  process  as  signal  processing  block  diagram.  The  quantized  data 
is  modeled  with  dynamic  dependence  model  with  discrete  state  conditional  distribu¬ 
tions.  Creating  an  separate  eight  symbol  codebook  for  each  time-series,  and  then  using 
the  corresponding  discrete  quantized  versions  each  time-series  for  dynamic  dependence 
analysis  using  an  HFactMM,  we  estimate/learn  parameters  which  yield  the  state  condi- 
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Figure  4.5.  A  More  Complex  2D  Example:  a)  True  distribution.  F°=thin  black,  F1=thick  red  b) 
Learned  FactMM  w/  correct  parameterization  c)  learned  HFactMM  with  correct  parameterization  d) 
learned  HFactMM  with  Discrete  Code,  e)  Learned  HFactMM  w/  KDE.  The  mean  accuracy  over  50  trials 
was  FactMM=64%,  HFactMM=99%,  HFactMM  w/  Discrete  Code=98%,  HFactMM  w/  KDE=98% 
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Figure  4.6.  Quantization  of  Each  Observed  Time-Series  T>v :  A  codebook  is  learned  for  the  input 
data  treating  each  sample  t  independently.  Using  the  codebook  T>v  is  quantized  to  form  a  new  T>v . 
Note  that  each  time-series  is  quantized  separately. 


tional  models  shown  in  Figure  4.5(d).  This  technique  produced  an  average  accuracy  of 
98%  in  100  trials.  Note  that  while  the  model  is  a  crude  approximation  to  the  true  dis¬ 
tribution  it  captures  enough  information  about  dependence  to  make  a  correct  decision. 
We  will  take  a  similar  approach  in  Chapter  5. 


■  4.4  Bayesian  Inference 

In  the  previous  section  we  examined  how  ML  inference  can  be  used  to  obtain  point 
estimates  of  K  structures  and  a  state  sequence  indexing  which  structure  is  active  at  each 
point  in  time.  In  this  section  we  return  to  the  problem  of  obtaining  a  full  posterior  on 
structure  and  state  sequence  given  the  observed  time-series  rather  than  point  estimates. 
Again,  this  is  a  difficult  task  due  to  the  number  possible  state  sequences,  KT ,  and 
requires  one  to  define  priors  over  the  unobserved  parameters  and  structure. 

Our  primary  motivation  for  developing  a  method  for  Bayesian  inference  is  that 
obtaining  the  posterior  on  dependence  structure  is  desirable  over  point  estimates  in 
applications  such  as  moving  object  interaction  analysis.  That  is,  in  Chapter  6  we 
make  no  assumptions  as  to  the  existence  of  a  “true/correct”  sequence  of  dependence 
structure  and  focus  on  characterizing  posterior  uncertainty  in  the  relationships  among 
multiple  moving  objects.  We  use  a  STIM  for  this  task  and  specialize  our  discussion 
and  notation  in  this  section  to  such  models.  That  is,  we  adopt  a  dynamic  dependence 
model  which  uses  TIM  state  conditional  distributions  specified  by  directed  structures 
Ek  and  parameters  Qk . 

In  the  next  section  we  define  a  tractable  prior  over  the  structure  and  parameters  of  a 
STIM.  We  then  consider  a  Markov  chain  Monte  Carlo  (MCMC)  approach  for  obtaining 
samples  from  the  joint  posterior  rather  than  an  exact/full  characterization  (cf.  [1,  36]). 
We  show  that,  while  the  joint  samples  of  z±-_t  and  E  drawn  using  our  MCMC  approach 
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Figure  4.7.  A  STIM(0,K)  shown  as  a  directed  Bayesian  network:  The  latent  state  zt  indexes  the 
directed  structure  Ek  and  parameters  0fc  of  a  TIM  to  describing  the  evolution  of  all  time-series  X*. 
Hyperparameters  a  specify  a  prior  on  the  transition  distribution  described  by  n.  Hyperparameters  /? 
and  T  are  used  to  define  priors  on  directed  structure  and  parameters  respectively. 

are  used  as  a  proxy  to  the  true  posterior,  given  a  specific  set  of  labeled  time  points  in 
which  zt  =  k  one  can  always  obtain  exact  posteriors  over  structure  Ek  as  discussed  in 
Section  3.4.3. 

■  4.4.1  A  Prior  for  STIMs 

The  generative  model  for  a  STIM(l,i,T)  was  introduced  in  Section  4.2.  Here,  we  discuss 
placing  priors  on  the  parameters  of  this  model.  We  assume  the  prior  on  all  parameters 
and  structures  factorizes  as: 

K 

Po  (E,  0)  =  p0  (vr°,  . . . ,  7TK)  []  po  ( Ek ,  0fc)  . 

k= 1 


(4.24) 
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Each  state  transition  parameter  set  irk  is  treated  independently  and  a  Dirichlet  prior 
distribution  is  used  (which  is  conjugate  for  a  multinomial): 


K 


~[po  (irk\ak] 

(4.25) 

k= 0 

K 

^  Dir  Uk-,akl,...,akK)  . 

(4.26) 

k= 0 

Each  hyperparameter  ak  can  be  interpreted  as  a  pseudo-count  of  how  many  times  one 
has  seen  a  transition  from  state  k  to  state  j.  State  persistence  (approximating  a  semi- 
rnarkov  process)  can  be  favored  by  setting 

ctj  =  c  V  j  /  k  and  a\ i  =  be,  (4.27) 

where  c  is  a  constant  number  of  pseudo  counts  and  b  >  1  is  a  multiplier  for  c  to  favor 
self  transitions. 

The  prior  for  the  structure  and  parameters  of  the  state  conditional  TIMs  takes  the 
form  described  in  Section  3.4.2: 

Po  (£fc,  0fc)  =  p0  (Ek\(3)  Po  ( @k\Ek ,  T)  (4.28) 

An  alternative  model  could  place  state  specific  priors  with  each  state  having  its  own  (3k 
and  Tk.  Figure  4.7  depicts  a  STIM(0,AT)  along  with  its  priors  as  a  dynamic  Bayesian 
network. 

■  4.4.2  MCMC  Sampling  for  a  STIM 

Given  observed  time-series  in  T>  and  a  prior  over  parameters  and  structure  of  a  STIM, 
we  wish  to  characterize  the  posterior  over  structure  at  each  point  in  time.  Again,  this 
is  a  difficult  task  due  to  the  large  space  of  KT  possible  state  sequences.  Here,  we  take 
an  MCMC  approach.  Iterative  MCMC  methods  are  a  useful  for  drawing  samples  from 
an  otherwise  intractable  target  distribution.  These  methods  construct  a  Markov  chain 
that  has  the  target  distribution  of  interest  as  its  equilibrium  distribution.  That  is,  after 
a  large  number  of  iterative  steps  a  sample  from  this  chain  is  a  valid  sample  from  the 
target  distribution.  The  question  of  how  many  iterations  are  sufficient  is  a  difficult 
one  to  answer  and  guarantees  on  obtaining  valid  samples  are  only  asymptotic  [1,  36]. 
In  practice  we  tend  to  initialize  multiple  samplers  and  run  them  for  a  fixed  period  of 
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time,  examining  their  output  in  the  context  of  the  problem  of  interest  to  determine  if 
sufficient  time  was  given. 

Here,  we  design  a  Gibbs  sampler  for  drawing  samples  from  the  posterior  using  a 
STIM.  A  Gibbs  sampler  is  specific  type  of  MCMC  method  that  is  well  suited  to  our 
hidden  state  dynamic  dependence  model  in  that  it  uses  exact  conditional  distributions 
that  are  tractable  to  compute  using  a  STIM  [37].  Our  Gibbs  sampler  has  three  main 
steps  which  are  outlined  in  Algorithm  4.4.1. 

Algorithm  4.4.1  The  Three  Main  Steps  a  STIM  Gibbs  Sampler. 

Require:  previous  sample  of  the  set  of  structures  and  parameters  ©l*"1) 

%  The  three  main  steps 

Step  1.  W*)  ~  p  (zi:T\T>,  Ef'-1),  0^*_1)) 

Step  2.  {7r°W, . . . ,  nK^}  ~  p  (7r|2:i:^,a) 

Step  3.  {©0,0(0}  ~p(E,0|D,  W*)) 


The  first  step  samples  the  state  sequence  given  a  previous  sample  of  structures 
and  parameters.  This  is  done  efficiently  with  backward  message  passing  followed  by 
forward  sampling.  High  level  details  of  Step  1  are  shown  in  Algorithm  4.4.2.  A  simple 
modification  of  Step  1  is  used  to  initialize  the  sampler  by  sampling  a  state  sequence 
using  only  the  transition  prior  rather  than  incorporating  information  from  the  data. 

In  Step  2,  the  sampled  counts  of  state  transitions  are  recorded.  Let  nk  for  k  G 
{0,  ..,K}  and  j  €  {1,...,  A'}  be  a  count  of  the  number  times  zt- jW  =  k  and  zP'l  =  j. 
Given  these  quantities  we  then  sample  the  transition  probabilities  using  Algorithm  4.4.3. 

In  Step  3,  a  vector  is  formed  with  all  the  time  points  with  zt  =  k.  The  structure 
and  parameters  are  then  sampled  given  T> tfc  for  each  k.  See  Algorithm  4.4.4.  Step  3 
requires  one  to  sample  from  the  posterior  over  structure  and  parameters.  Details  can 
be  found  in  Appendix  ??.  It  is  important  to  note  when  using  a  switching  temporal 
interaction  model,  one  can  efficiently  calculate  exact  event  probabilities  and  posterior 
over  structures  given  a  specific  state  sequence  as  described  in  Section  3.4.3.  That  is, 
one  can  perform  static  dependence  analysis  on  the  data  V tk  for  each  k. 

Algorithm  4.4.4  exposes  how  information  about  each  parameter  Qk  and  structure 
Ek  is  obtained  from  distinct  parts  of  the  data  E>tk.  Thus,  there  is  more  information 
to  take  advantage  of  when  determining  the  structure  /  state  at  each  point  in  time,  in 
contrast  to  a  windowed  approach.  We  refer  the  reader  to  Chapter  6  for  illustrative 
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Algorithm  4.4.2  Step  1  of  the  STIM  Gibbs  Sampler  During  Iteration  i 
Require:  The  previous  sampled  structures  Ef1-1)  and  parameters  ©^-1) 

%  Generate  backward  messages  mtk  =  p  (r>(t+iy.T\zt  =  k,  Vt+i, 

for  A:  =  1  to  K  do 

mp  1  %  Initialize  Messages 
end  for 

for  k  =  1  to  K  do 

for  t  =  T  —  1  down  to  1  do 

ml  <-  Eflivrf-1)  p  (Pt+ilA+r,^-1),©*-1))  m$+1 

end  for 
end  for 

%  Sample  state  sequence  znjfo  working  sequentially  forward  in  time 
for  t  =  1  to  T  do 
for  k  =  1  to  K  do 

%  Using  probability  fk  =  p  (zt  =  k\z1:^_l)z  V) 

fk  <-  (AIA,^-1),©^-1))  m*+1 

end  for 

%  Sample  the  state  assignment  for  zp> 

zp^  ~  Discrete  ( zt ;  /) 

end  for 

Algorithm  4.4.3  Step  2  of  the  STIM  Gibbs  Sampler  During  Iteration  i. 
Require:  nk ,  for  each  A  €  {0, . . ,  JU }  based  on  Zt—i®  =  k 

%  Sample  7rfc 

~  Dir  (7rfc;  +  nk, +  nkK) 

Algorithm  4.4.4  Step  3  of  the  STIM  Gibbs  Sampler  During  Iteration  i 
Require:  Sampled  for  each  k  €  {1,...,A'} 

Let  tfc  =  {t  | z/d  =  A} 

%  Calculate  the  posterior  given  data  at  points  tw  and  sample 

{EM ,  ew  }~p(Ek,  Qk\vtw ,  vtw ,  / ? ,  r ) 
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experiments  and  specific  examples  using  this  Gibbs  sampler. 

■  4.5  Related  Work 

As  discussed  in  the  sections  above,  our  DDM  is  closely  related  to  a  standard  HMM.  A 
DDM  differs  in  that  the  latent  state  indexes  not  only  parameters,  but  structures  as  well. 
Specifically,  the  dependence  structure  among  observations  is  a  function  of  the  value  of 
the  hidden  state.  A  standard  graphical  model  cannot  capture  the  relationship  between 
the  value  of  a  random  variable  and  dependence  structure.  We  hid  this  relationship 
in  Figure  4.1  by  representing  all  time  series  as  a  single  random  variable,  X^.  If  one 
wishes  to  expose  the  details  of  how  structure  changes  as  a  function  the  value  of  nodes 
in  a  graphical  representation,  a  new  set  of  semantics  beyond  that  of  standard  graphical 
models  is  required. 

The  study  of  models  that  can  encode  structure  as  a  function  of  the  value  of  a  ran¬ 
dom  variable/node  can  be  traced  back  to  Heckerman  and  Geiger’s  similarity  networks 
and  Bayesian  multinets  [44,  35].  These  models  can  represent  asymmetric  independence 
assertions,  i.e.  variables  are  independent  for  some  but  not  all  their  values.  Similarity 
networks  are  represented  as  a  collection  of  Bayesian  networks,  each  of  which  holds  true 
for  a  particular  setting  of  a  hypothesis  random  variable.  These  networks  were  used 
primarily  in  expert  diagnosis  systems  for  various  diseases.  The  representation  allows 
for  an  expert  on  a  particular  disease  to  design  a  Bayesian  network  that  only  includes 
the  variables/symptoms  that  are  relevant  for  diagnosing  that  disease.  They  did  not  dis¬ 
cuss  inference  in  the  context  of  these  similarity  networks.  Their  focus  was  primarily  on 
providing  a  way  to  convert  a  similarity  network  to  a  full  Bayesian  network.  The  asym¬ 
metric  independence  assumptions  were  then  encoded  in  the  parameters  of  conditional 
probability  distributions  rather  than  in  the  topology  of  the  network.  Thus,  a  standard 
inference  algorithm  would  not  have  access  to  the  added  independence  information  and 
could  not  take  explicit  advantage  of  it  to  reduce  computation. 

Bayesian  multinets  were  introduced  in  an  effort  to  enhance  similarity  networks  and 
provide  more  efficient  inference  techniques.  Multinets,  much  like  similarity  networks, 
are  represented  by  a  multiple  Bayesian  networks  each  centered  around  a  single  hy¬ 
pothesis  variable/node.  Each  individual  network  holds  true  for  a  particular  subset  of 
hypotheses.  However,  unlike  similarity  networks,  each  network  in  a  multinet  contains 
all  variables.  Geiger  [35]  describes  a  general  algorithm  for  efficient  inference  on  multi- 


108 


CHAPTER  4.  DYNAMIC  DEPENDENCE  MODELS  FOR  TIME-SERIES 


nets  and  showed  how  the  encoded  asymmetric  independence  assumptions  can  reduce 
storage  requirements  and  reduce  computation.  He  also  showed  how  this  algorithm  can 
be  used  for  general  inference  on  similarity  networks  by  providing  a  way  to  convert  a 
similarity  network  into  a  Bayesian  multinet.  Seeking  even  more  flexible  models  Geiger 
and  Heckerman  also  introduced  the  concept  of  a  generalized  similarity  network  in  which 
there  may  be  multiple  hypothesis  nodes  with  relationships  between  them. 

This  class  of  models  was  further  explored  and  formalized  by  Boutilier,  et  al.'s 
Context-Specific  Independence  (CSI)  [11].  CSI  is  a  general  notion  that  encompasses 
the  idea  of  asymmetric  independence  discussed  by  Heckerman  and  Geiger.  CSI  de¬ 
scribes  situations  in  which  two  variables,  X  and  Y,  are  independent  given  a  certain 
context  c,  i.e.  a  particular  setting  of  values  to  certain  random  variables.  Boutilier,  et 
al.  showed  how  to  verify  global  CSI  statements  given  a  set  of  local  CSI  statements 
in  a  network.  Their  work  also  showed  how  CSI  yields  regularities  in  the  conditional 
probability  tables  (CPTs)  for  each  variable  in  a  Bayesian  network.  Unlike  Geiger  and 
Heckerman  who  used  multiple  networks  to  encode  additional  independence  assump¬ 
tions,  [11]  used  a  structured  representation  for  their  condition  distributions  to  encode 
CSI.  They  focused  on  the  use  of  decision-tree  structured  conditional  probability  tables 
(CPT)  and  show  how  this  particular  representation  for  CSI  is  compact  and  can  be 
used  to  support  effective  inference  algorithms.  In  a  companion  paper  Friedman  and 
Goldszmidt  [32]  showed  how  this  structured  representation  of  CPTs  can  aid  in  learning 
Bayesian  networks  from  data. 

More  recently  Milch,  et  al.  have  discussed  further  generalizations  of  Bayesian  net¬ 
works  that  focus  on  making  CSI  explicit  [64],  They  introduced  the  concept  of  partition- 
based  models  (PBMs)  in  which  the  conditional  probability  distribution  of  each  variable 
is  determined  by  a  particular  partitioning  of  the  outcome  space  rather  than  on  a  set 
of  parents.  They  provide  a  specific  implementation  of  a  PBM  called  a  contingent 
Bayesian  network  (CBN).  A  CBN  combines  the  use  of  structured  CPTs  (as  in  [11]) 
with  labeling  edges  in  a  Bayesian  network  with  contexts/conditions  in  which  the  edges 
are  active/present.  This  edge  labeling  provides  a  simple  representation  in  which  CSI 
relationships  can  be  more  easily  read  from  a  graphical  depiction  of  the  model.  A  CBN 
can  also  contain  cycles  and  have  an  unbounded  number  of  latent  variables.  Milch  et 
al.  discuss  the  conditions  in  which  a  CBN  defines  a  unique  distribution  and  provide  an 
algorithm  for  approximate  inference. 

In  [7],  Bilrnes  introduced  a  model  for  time-series  that  makes  use  of  CSI.  His  dynamic 
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Bayesian  multinet  is  a  hidden  state  model  in  which  the  value  of  the  hidden  state  at 
time  t  encodes  the  dependence  structure  among  the  observations  within  a  local  time 
window  surrounding  t.  Bilrnes’  work  focused  on  techniques  for  learning  the  state  con¬ 
ditional  dependence  structure  from  labeled  training  data  to  help  improve  classification. 
These  techniques  use  a  greedy  pairwise  criteria  for  adding  new  edges  to  encode  causal 
relationships  in  a  network.  He  focused  on  the  problem  domain  of  speech  recognition 
and  demonstrated  how  learning  class-specific  (state-specific)  discriminative  structure 
can  help  improve  performance. 

Our  DDM  belongs  to  this  general  class  of  models  as  do  the  related  models  of  [7, 
52,  94],  It  is  most  closely  related  to  Bilrnes’  dynamic  Bayesian  multinet  in  that  when 
using  a  TIM  we  are  reasoning  over  directed  causal  structures.  However,  our  inference 
task  is  that  of  unsupervised  discovery  of  structure  rather  than  learning  a  better  model 
for  predictive  analysis  given  training  data.  In  addition,  by  using  the  static  dependence 
models  and  inference  tools  discussed  in  Chapter  3  we  are  able  to  provide  exact  Bayesian 
reasoning  over  structure.  That  is,  we  do  not  rely  on  greedy  search  or  other  approximate 
methods  for  structural  discovery. 

■  4.6  Summary 

In  this  chapter,  the  static  dependence  models  discussed  in  Chapter  3  were  extended 
to  allow  structure  to  change  over  time.  This  was  done  by  introducing  the  class  of 
DDMs.  A  DDM  uses  a  dynamically  evolving  latent  variable  to  index  both  structure 
and  parameters  of  an  underlying  static  dependence  mode  over  time.  We  discussed  how 
dynamic  dependence  analysis  can  be  performed  via  inference  on  a  DDM  and  contrasted 
this  approach  with  standard  windowed  analysis.  In  contrast  to  ML  inference  on  static 
dependence  model,  we  showed  how  ML  inference  on  a  DDM  can  take  advantage  of 
parametric  differences  when  distinguishing  structure.  This  analysis  was  supported  by 
a  set  of  illustrative  experiments.  By  placing  priors  on  the  transition  distribution  of  the 
DDM  along  with  those  presented  in  Chapter  3  for  a  TIM,  we  demonstrated  how  one 
can  using  a  Gibbs  sampler  for  Bayesian  inference.  While  this  approach  only  produces 
samples  from  the  posterior,  given  a  sampled  state  sequence  exact  marginal  posterior 
statistics  can  be  calculated. 
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Application:  Audio-Visual 

Association 


As  technology  rapidly  advances,  electronics  used  for  digital  audio-visual  recording  and 
storage  have  become  less  expensive  and  their  presence  in  our  environment  more  ubiqui¬ 
tous.  In  addition  to  movies,  television  shows  and  news  broadcasts,  inexpensive  recording 
devices  and  storage  have  allowed  for  business  meetings,  judicial  proceedings  and  other 
less  formal  group  meetings  to  be  digitally  archived.  At  the  same  time,  technologies  such 
as  automatic  speech  recognition  [60]  and  face  detection  and  recognition  [91,  89,  95]  have 
allowed  for  useful  metadata  to  be  extracted  from  these  archives.  Searches  can  then  be 
performed  to  quickly  find  videos  which  contain  certain  phrases  or  involve  particular 
individuals.  In  this  chapter,  we  explore  the  problem  of  determining  which,  if  any,  of  the 
individuals  in  the  video  are  speaking  at  each  point  in  time.  Whereas  one  could  make 
use  of  strong  prior  models  of  audio-visual  appearance  when  they  are  available,  here  we 
show  that  the  problem  can  be  cast  as  one  of  audio-visual  association,  demonstrating  the 
utility  of  the  previous  development  and  bypassing  the  need  for  strong  prior  assumptions. 
Specifically,  we  frame  the  problem  in  terms  of  dynamic  dependence  analysis. 

Hershey  and  Movellan  have  previously  showed  how  measuring  correlation  between 
audio  and  video  pixels  can  help  detect  who  is  speaking  in  a  scene  [46].  Nock  and 
Iyengar  [68]  provided  an  empirical  study  of  this  technique  on  a  standardized  dataset, 
CUAVE  [72],  Further  study  of  detecting  and  characterizing  the  dependence  between 
audio  and  video  was  carried  out  by  Slaney  and  Coveil  [84]  and  Fisher  et  al.  [30].  All  of 
these  techniques  process  data  using  a  sliding  window  over  time  assuming  a  single  audio 
source  within  the  analysis  window.  As  such,  they  do  not  incorporate  data  outside  of 
the  analysis  window.  In  this  chapter,  we  show  empirically  that  by  treating  the  problem 
as  a  dynamic  dependence  analysis  task  in  which  data  over  all  time  is  considered,  one 


111 


112 


CHAPTER  5.  APPLICATION:  AUDIO-VISUAL  ASSOCIATION 


can  exploit  audio  and  visual  appearance  changes  associated  with  who  is  speaking.  In 
doing  so,  we  achieve  the  best  performance  to  date  on  a  standard  dataset  for  audio-visual 
speaker  association.  This  performance  was  achieved  without  any  window  parameters 
to  set,  without  a  silence  detector  or  lip  tracker  and  without  any  labeled  training  data. 

We  begin  by  introducing  the  three  datasets  used  in  our  experiments.  We  then  briefly 
discuss  the  basic  audio-visual  features  extracted  for  input  to  our  dynamic  dependence 
analysis  in  Section  5.2.  In  Section  5.3  we  map  reasoning  over  audio-visual  association 
to  reasoning  over  the  dependence  structures  of  a  static  dependence  model.  Lastly,  we 
present  results  in  Section  5.4. 

■  5.1  Datasets 

For  our  audio-visual  speaker  association  experiments  we  use  three  separate  datasets. 
Each  involves  recordings  of  a  pair  of  individuals  taking  turns  speaking.  All  datasets 
contain  video  recorded  at  29.97  frames  per  second.  We  convert  each  video  frame  to  a 
256  value  grayscale  image.  The  audio  for  each  dataset  was  resampled  at  16kHz  and 
broken  up  into  segments  aligned  with  the  video  frames. 

■  5.1.1  CUAVE 

The  CUAVE  dataset  [72]  is  a  multiple  speaker,  audio-visual  corpus  of  spoken  connected 
digits.  We  use  the  22  clips  from  the  groups  set  in  which  two  speakers  take  turns  reading 
digit  strings  and  then  proceed  to  utter  two  different  strings  simultaneously.  In  order  to 
compare  to  prior  work  [68,  40],  we  only  consider  the  section  of  alternating  speech.  In 
each  clip  both  individuals  face  the  camera  at  all  times.  Ground  truth  was  provided  by 
Besson  and  Monaci  [6]. 

A  separate  video  stream  is  extracted  for  each  speaker  in  the  CUAVE  corpus.  This 
is  done  using  a  face  detector  [91]  and  simple  correlation  tracking  of  the  nose  region 
in  order  to  obtain  a  stabilized  face.  Figure  5.1.1  shows  a  sample  frame  from  raw 
video  in  addition  to  extracted  faces  for  each  individual  in  the  dataset.  As  seen  in  the 
Figure  5.1.1,  the  CUAVE  dataset,  has  fairly  high-resolution  frames  with  an  uncluttered 
background.  As  such  empirical  results  are  useful  for  relative  comparisons  rather  then 
absolute  performance. 
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(a)  Sample  frame  from  one  sequence  in  the  CUAVE  dataset 


(b)  Extracted  Faces  for  all  CUAVE  sequences 


Figure  5.1.  The  CUAVE  Dataset:  a)  A  sample  video  frame  from  one  sequence,  b)  Individual  frames 
from  the  automatically  extracted  video  streams  for  each  face  for  all  22  sequences.  Faces  were  extracted 
using  a  face  detector  and  simple  correlation  tracking  to  stabilize  the  region  around  a  person’s  nose. 


■  5.1.2  Look  Who’s  Talking 

While  the  CUAVE  corpus  provides  a  standardized  dataset  for  comparing  performance 
with  prior  work,  the  individuals  recorded  do  not  interact  with  each  other  and  are  always 
facing  the  camera.  Our  second  dataset  is  a  single  video  recorded  in  the  same  style  as 
the  CUAVE  database  in  which  two  individuals  take  turns  speaking  digits.  However, 
while  the  speaker  looks  into  the  camera,  the  non-speaking  turns  to  look  at  the  speaker. 
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(a)  Sample  Frame 


(b)  Extracted  Faces 


Figure  5.2.  “Look  Who’s  Talking”  Data:  a)  A  sample  video  frame  b)  The  extracted  hand-labeled 
face  regions. 


This  provides  a  dataset  in  which  there  is  a  strong  appearance  change  associated  with 
who  is  speaking,  as  may  be  the  case  in  a  meeting  where  participants  look  toward  the 
current  speaker. 

A  separate  video  stream  is  extracted  for  each  individual.  In  this  video  the  individuals 
change  their  head  pose  throughout  the  video.  One  could  develop  an  extension  to  the 
simple  face  tracker  used  for  the  CUAVE  dataset  and  attempt  to  extract  a  representation 
for  each  person’s  face  which  is  stabilized  and  normalized  for  head  pose.  However,  such 
an  approach  would  throw  out  potentially  informative  features.  That  is,  in  this  video 
head  pose  is  an  excellent  cue  as  to  who  is  speaking.  Ideally,  one  would  extract  the 
head  pose  information  and  uses  them  as  additional  observed  features.  Here,  we  take 
the  simple  approach  of  hand  labeling  a  rectangular  region  in  the  video  for  each  person 
and  crop  the  video  accordingly.  That  is,  we  sacrifice  perfectly  tracked  faces  for  a  simple 
representation  that  can  still  capture  strong  appearance  changes  that  occur  with  a  change 
of  speaker.  Figure  5.1.2  shows  a  sample  frame  from  the  raw  video  and  extracted  face 
regions. 

■  5.1.3  NIST  Data 

Lastly,  we  move  away  from  scripted  speech  scenarios  and  explore  more  realistic  data 
involving  meeting  conversations.  The  third  dataset  is  a  single  camera  recording  of 
a  sequence  from  the  NIST  meeting  room  pilot  corpus  [34]  (sequence  20011115-1050 
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■ 


(a)  Sample  Frame  (b)  Extracted  Faces 


Figure  5.3.  NIST  Data:  a)  A  sample  video  frame  b)  The  extracted  face  regions  using  face  detection 
information  provided  to  the  VACE  program. 


camera  1).  This  data  was  provided  as  part  of  the  Advanced  Research  Development 
Agency  (ARDA)  Video  Analysis  and  Content  Exploitation  (VACE)  program. 

This  sequence  is  a  recording  of  four  individuals  taking  turns  speaking.  Face  location 
information  was  provided  by  participants  in  the  VACE  program.  We  took  the  supplied 
face  track  information  and  extracted  a  video  sequence  for  two  individuals  facing  the 
camera  in  the  sequence  being  analyzed.  Figure  5.1.3  shows  a  sample  frame  and  the 
extracted  faces.  Note  that  in  this  dataset  the  extracted  face  regions  have  poor  resolution 
relative  to  the  previous  datasets. 

■  5.2  Features 

While  there  has  been  much  work  in  exploring  and  learning  informative  features  for 
audio-visual  association  tasks  [68,  40,  46,  84,  30],  the  focus  of  this  dissertation  is  on 
performing  dynamic  dependence  analysis  given  a  pre-specified  set  of  features.  To  this 
end,  we  choose  video  and  audio  features  that  capture  the  basic  dynamics  of  each  modal¬ 
ity  as  well  as  their  static  appearance. 

We  use  frame-based  features  for  the  video  streams  extracted  for  each  individual  in 
scene  as  well  as  the  audio.  That  is,  a  set  of  features  is  extracted  for  each  frame  of 
video.  There  are  two  basic  types  of  features  for  each  time-series.  Static  features  for 
frame  t  are  calculated  based  on  the  data  within  that  frame.  Dynamic  features  at  frame 
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t  incorporate  information  from  neighboring  frames. 

■  5.2.1  Video  Features 

For  each  sequence  there  are  two  video  time-series;  one  for  each  individual  in  the  scene. 
Given  the  raw  grayscale  video  sequence  for  each  individual  we  extract  a  simple  set 
of  static  appearance  features  for  each  frame.  These  static  features  are  calculated  by 
by  first  performing  principal  component  analysis  (PCA)  [74]  (c.f.  [50])  on  the  entire 
sequence  and  extracting  the  top  40  principal  components.  On  average  using  40  com¬ 
ponents  captured  over  90%  of  the  energy  for  each  sequence.  The  video  sequence  is 
then  transformed  by  projecting  each  frame  onto  this  40  component  basis,  yielding  40 
coefficient  values  for  each  frame  of  video. 

We  calculate  dynamic  features  by  first  taking  raw  pixel- wise  frame  differences.  At 
frame  t,  a  difference  image  was  formed  from  the  raw  frame  at  time  t  +  1  and  t  —  1. 
This  yields  a  difference  video  for  each  sequence.  Similarly  to  the  static  features,  PCA  is 
performed  on  this  video  sequence  and  40  coefficients  for  each  frame  are  then  extracted. 

Each  of  these  feature  streams  is  then  vector-quantized.  A  20  symbol  codebook 
is  learned  separately  for  the  dynamic  and  static  features  by  learning  a  20  component 
Gaussian  mixture  model  using  EM.  Each  stream  is  then  encoded  using  its  corresponding 
codebook,  yielding  a  discrete  observation  for  each  stream  at  each  frame  1 .  Figure  5.4(a) 
summarizes  how  the  video  features  are  obtained  via  a  block  diagram. 

■  5.2.2  Audio  Features 

For  each  sequence  analyzed  there  is  a  single  audio  stream  broken  up  into  segments 
corresponding  to  video  frames.  We  calculate  13  Mel-frequency  cepstral  coefficients 
(MFCCs)  for  each  of  these  frames  [63,  13]  to  form  what  we  will  refer  to  as  our  static 
audio  features.  MFCCs  are  common  perceptually  motivated  feature  used  in  automatic 
speech  recognition.  Dynamic  features  at  frame  t  are,  again,  formed  by  taking  the 
difference  between  the  raw  static  MFCC  features  at  frame  t  +  1  and  t  —  1.  These  audio 
feature  streams  are  separately  vector  quantized  using  learned  20  symbol  codebooks. 
Figure  5.4(b)  summarizes  how  the  video  features  are  obtained  via  a  block  diagram. 

1We  arbitrarily  chose  20  symbols  at  first.  Some  quick  analysis  of  smaller  and  larger  codebooks 
showed  little  change  in  performance.  However,  more  thorough  analysis  must  be  done  before  drawing 
any  conclusions  about  codebook  size.  Here,  we  simply  provide  a  fixed  video  extract  technique  for  all 
datasets. 
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video^:T 


(a)  Video  Processing 


audio  MFCCs 


(b)  Audio  Processing 


Figure  5.4.  Video  and  Audio  Feature  Extraction  Block  Diagrams:  Sub-blocks  are  explained  in  Figure 
5.5.  (a)  A  block  diagram  summarizing  how  both  the  static  T>Vi  and  dynamic  Vv'  video  features  for 
person  i  are  obtained,  (b)  A  block  diagram  summarizing  how  the  static  VA  and  dynamic  VA  audio 
features  are  obtained. 


■  5.2.3  Inputs  to  Dynamic  Dependence  Analysis 

For  each  sequence  analyzed,  this  yields  three  quantized  static  feature  time-series;  one 
representing  the  audio,  x^T,  and  two  representing  the  video  for  each  person  in  the 
scene,  x|^T  and  x^2T .  In  addition,  there  are  three  corresponding  dynamic  feature 
streams,  x^,  xjl'T  and  x^2T  .  These  are  the  N  =  6  time-series  input  for  analysis.  For 
clarity,  we  will  index  them  as  As ,  AD ,  V\S ,  V\D,  V2S  and  V2D ,  rather  than  1  through  6, 
where  the  superscript  denotes  (S)tatic  or  (D)ynamic  features  and  the  subscript  indexes 
the  speaker. 

It  is  important  to  note  that  the  dimensionality  reduction  with  PCA  and  codebook 
learning  is  done  separately  for  each  stream  and  for  each  data  sequence  analyzed.  That 
is,  no  user  or  dataset  specific  training  is  being  performed. 
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Number  of 


Number  of 


(c)  Quantization 


Figure  5.5.  Sub-blocks  for  Audio  and  Video  Processing:  (a)  A  diagram  showing  how  the  the  (t+ 1,  t—  1) 
sub-block  works.  It  outputs  the  difference  between  a  positive  and  negative  temporal  delay  of  its  input, 
(b)  A  diagram  showing  how  dimensionality  reduction  is  performed.  Treating  each  sample  independently 
PCA  is  performed  on  all  of  the  input  data  to  learn  a  basis  <f>  with  a  specified  number  of  components.  All 
the  data  is  then  projected  onto  this  basis,  (c)  A  diagram  showing  how  vector  quantization  is  performed. 
All  the  data  input  data  is  used  to  learn  a  codebook  with  a  specified  number  of  symbols.  We  form  this 
codebook  by  learning  a  Gaussian  mixture  model.  The  data  is  then  quantized  using  this  codebook, 
encoding  each  sample  with  the  index  of  the  mixture  component  best  describes  the  sample. 


■  5.3  Association  via  Factorizations 

Our  goal  is  to  determine  which,  if  any,  of  the  people  in  a  given  sequence  are  producing 
the  recorded  audio  at  each  point  in  time.  We  map  this  problem  to  that  of  determining 
which  video  observation  is  associated  with  the  audio  observation.  We  describe  this 
association  in  terms  of  a  factorization  on  the  input  time-series.  For  this  task  we  define 
a  finite  set  of  factorizations  we  wish  to  identify. 

When  individual  1  is  speaking  we  expect  the  audio  to  be  associated  with  the  video  for 
individual  1.  Similarly,  we  expect  the  audio  to  be  associated  with  the  video  individual  2 
when  her  or  she  is  speaking.  When  neither  of  the  individuals  are  speaking  we  expect  no 
association  among  the  input  time-series.  Thus,  we  consider  three  possible  factorizations 
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(a)  F1  :  Person  1  is  speaking 


(b)  F2  :  Person  2  is  speaking 


(c)  F3:  Neither  person  is  speaking 


Figure  5.6.  The  Three  Audio  and  Video  Factorizations  Considered:  The  factorizations  are  shown  as 
association  graphs.  When  person  1  is  speaking  we  assume  the  feature  time-series  factorize  as  shown 
in  (a).  When  person  2  is  speaking  we  assume  (b).  When  neither  person  is  speaking  we  assume  all 
observed  time-series  are  independent. 

being  active  at  any  point  in  time: 


F 1  =  {{AD,  V^},  {V2d},  {A5},  m5},  {y25}} 


(5.1) 

(5.2) 

(5.3) 


F3  =  {FiD},  {V2d},  {FI5},  {C25}} 


Figure  5.6  shows  association  graphs  for  these  three  factorizations.  Note  that  the  struc¬ 
tural  differences  between  these  3  factorizations  are  only  in  the  dynamic  features.  The 
implicit  assumption  is  that  the  dependence  information  is  primarily  capture  by  the 


dynamics  of  the  audio-visual  speech  process  while  static  features  primarily  exhibit  ap¬ 
pearance  changes  (i.e.  parametric)  rather  than  dependence  changes.  As  discussed  in 
Chapter  4  while  a  windowed  approach  for  analysis  will  only  be  able  to  take  advantage 
of  the  structural  differences,  a  dynamic  dependence  model  can  use  these  static  features 
to  help  distinguish  which  is  the  correct  factorization. 
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■  5.4  AV  Association  Experimental  Results 

We  perform  dynamic  dependence  analysis  on  these  datasets  using  three  different  ap¬ 
proaches.  The  first  approach  performs  windowed  analysis  comparing  FactMs  using  the 
three  possible  factorizations  defined  in  Equations  5.1,  5.2  and  5.3.  We  use  the  abbre¬ 
viation  WFT  to  denote  this  as  a  windowed  factorization  test.  For  the  WFT,  at  each 
frame,  the  likelihood  ratios  Zit2,  Zi,3  and  ^2,3  are  calculated  using  a  window  of  samples 
centered  around  that  frame.  Note  that,  by  virtue  of  an  additional  edge,  both  F 1  and 
F 2  are  more  expressive  than  the  fully  independent  F3.  Thus,  additionally,  p- values  for 
Zi; 3  and  ?2,3  are  calculated  via  100  permutations.  If  Zg 2  is  positive  (negative)  then  we 
eliminate  F2  (F1)  as  a  possible  hypothesis  and  use  the  p- value  for  Zg3  (  (2,3  )  to  choose 
between  F1  (  F2  )  and  F3.  That  is,  we  first  compare  which  video  stream  is  most  likely 
associated  with  the  audio  by  looking  at  l\p-  Given  the  most  likely  of  the  two  we  then 
look  at  the  p- value  for  that  hypothesis  when  compared  to  no  association,  F3,  to  decide 
if  we  should  reject  F3.  Window  sizes  of  8,  15,  30,  60,  90  and  120  frames  with  various 
p-value  thresholds  are  tested.  For  each  sequence,  the  final  result  recorded  is  the  one 
with  the  window  size  and  p-value  threshold  that  gives  the  best  performance  for  that 
particular  sequence.  While  selection  of  the  settings  which  yield  the  best  performance 
is  not  reasonable  in  practice,  our  purpose  is  to  use  these  results  as  a  best-case  baseline 
for  comparison  to  the  dynamic  approach. 

The  second  approach  uses  an  HFactMM  with(0,3)  with  each  state  linked  to  a  single 
factorization.  That  is,  state  i  indexes  the  factorization  F*  allowing  the  semantic  label 
of  “Who  is  speaking”  to  be  directly  read  from  the  state  sequence  inferred  by  the  model. 
Similarly,  the  third  approach  uses  a  FactMM(0,3)  which  removes  the  Markov  structure 
on  the  state  sequence.  We  perform  ML  inference  using  both  of  these  models  to  get  the 
most  likely  state  sequence  after  learning  parameters  via  EM.  For  EM,  100  random  ini¬ 
tializations  are  used  with  a  maximum  of  80  iterations  each.  The  most  likely  parameters 
are  kept.  In  most  cases  EM  converges  before  40  iterations  and  the  maximum  is  found 
by  multiple  initializations. 

■  5.4.1  CUAVE 

All  three  approaches  were  applied  to  all  22  sequences  in  the  CUAVE  corpus.  Table  5.1 
shows  a  summary  of  the  average  performance  for  each  method.  Performance  is  reported 
in  terms  of  the  percentage  of  frames  correctly  classified  according  to  the  ground  truth 
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HFactMM 

FactMM 

Best  WFT 

Nock  and 
Iyengar  [68] 

Gurban  and 
Thiran[40] 

Mean  Accuracy  (%) 

80.24 

78.51 

83.86 

75 

87.4 

Mean  Accuracy  C(%) 

88.11 

86.38 

83.42 

NA 

NA 

Table  5.1.  Results  Summary  for  CUAVE:  The  Best  WFT  accuracy  corresponds  to  the  WFT  with 
settings  that  maximized  the  average  performance  for  the  entire  dataset.  C=  silence  constraint  imposed 


Figure  5.7.  Results  for  CUAVE  Sequence  g09:  The  top  row  indicates  the  ground  truth  labeling  of 
who  is  speaking.  White  =  neither  person  is  speaking,  light  red  =  person  1,  dark  blue  =  person  2.  The 
following  rows  display  the  output  of  dynamic  dependence  analysis  using  an  HFactMM,  a  FactMM,  and 
a  windowed  approach  (WFT).  The  bottom  two  rows  show  post  processed  results  for  the  HFactMM  and 
FactMM  in  which  short  (<  25  frames)  segments  of  silence/neither  person  speaking  are  removed  to  be 
consistent  with  the  ground  truth  labeling  procedure.  The  accuracy  of  each  result  is  shown  below  the 
labeling. 


provided  by  [6].  The  first  row  of  Table  5.1  shows  that  all  the  techniques  give  around 
80%  accuracy.  Typically,  the  maximum  average  performance  of  the  WFT  was  obtained 
with  a  window  length  of  30  frames,  approximately  one  second.  This  shows  that,  with 
some  training  data  to  set  window  length  and  p-value  thresholds,  the  WFT  method 
would  do  well  with  these  features.  However,  these  results  are  somewhat  misleading,  as 
we  will  explain. 

Figure  5.7  shows  the  estimated  labels  for  a  typical  sequence  in  the  corpus  (g09).  The 
top  line  shows  the  ground  truth  labeling.  The  next  two  are  the  outputs  of  the  HFactMM 
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and  FactMM.  Notice  that  these  methods  disagree  with  the  ground  truth  by  consistently 
putting  non-speaking  (fully  independent)  blocks  between  speaker  transitions  and  within 
speaking  blocks.  Examination  of  these  sections  in  the  original  video  reveals  that  they  are 
actually  short  periods  of  silence.  Consequently,  the  HFactMM  and  FactMM  correctly 
labeled  these  sections.  The  WFT  does  not  exhibit  this  behavior  and  smoothes  over  the 
short  silence  regions. 

This  disagreement  with  the  provided  ground  truth  is  an  artifact  of  how  the  ground 
truth  was  defined.  In  [6],  it  is  stated  that: 

A  group  of  frames  is  labeled  as  silent  when  it  is  composed  of  at  least  25 

frames. 

That  is,  periods  of  silence  less  than  25  frames  within  speech  are  considered  part  of 
the  speech  according  to  the  ground  truth.  While  one  might  question  this  constraint, 
we  can  easily  incorporate  it  into  our  analysis  to  gain  a  fairer  comparison.  We  impose 
the  constraint  by  post  processing  the  outputs  to  remove  any  periods  of  labeled  silence 
(zt  =  3)  less  than  25  frames.  The  constrained  outputs  are  shown  in  the  last  two  lines 
of  Figure  5.7.  With  this  constraint  the  HFactMM  and  FactMM  outperform  all  other 
methods,  improving  to  88%  and  86%  respectively  as  shown  in  Table  5.1.  To  the  best  of 
our  knowledge  these  results  are  equivalent  to  or  better  than  all  other  reported  results  for 
audio-visual  speaker  association  on  the  CUAVE  group  set.  Nock  and  Iyengar  [68]  report 
75%  accuracy  with  a  windowed  Gaussian  MI  measure  and  Gurban  and  Thiran  [40]  get 
87.4%  with  a  trained,  audio-visual  speech  detector.  Both  methods  use  a  silence/speech 
detector  and  only  perform  a  dependence  test  when  there  is  speech.  Our  method  obtains 
better  performance  without  the  benefit  of  separate  training  data  or  a  silence  detector. 

Full  detailed  results  for  the  CUAVE  Corpus  are  found  in  Table  5.2. 

■  5.4.2  Look  Who’s  Talking  Sequence 

In  the  CUAVE  database,  much  of  the  information  about  who  is  speaking  comes  from 
the  changes  in  dependence  structure  between  the  audio  and  the  video.  This  is  evident 
in  the  fact  that  a  windowed  approach  performs  reasonably  well,  yielding  83%  accuracy. 
Appearances  changes  in  the  CUAVE  dataset  are  primarily  a  result  of  differences  in  the 
voice  characteristics  of  each  individual  and  the  fact  that  each  individual  only  moves 
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got 

g02 

g03 

g04 

g05 

g06 

g07 

g08 

g09 

glO 

gll 

g!2 

g!3 

g!4 

gl5 

g!6 

g!7 

g!8 

g!9 

g20 

g21 

g22 

Mean 


HFactMM 


Fact  MM 


HFactMM 


84.77 
84.42 
87.37 
88.06 

87.58 
68.49 

83.78 
80.95 
80.62 
78.14 
74.76 

61.91 
80.31 
83.25 

85.92 
69.86 
85.36 

64.23 
89.16 

84.81 

82.58 

78.81 

80.24 


80.19 

80.65 
78.26 
86.91 

83.30 
65.76 
82.16 
75.73 
75.28 
72.78 
79.47 

64.20 

76.31 
78.93 
84.02 
77.40 

81.66 
82.37 
83.05 
80.11 
81.52 
77.13 
78.51 


94.61 

92.09 

92.75 

96.06 

95.38 

68.17 

96.49 

92.46 

94.78 

80.41 
83.80 

69.85 

88.85 
93.19 

93.11 

71.78 
93.65 
72.06 
85.34 
97.01 

92.42 
94.21 

88.11 


C 


Fact  MM 


93.67 

89.57 
93.89 
93.63 

90.99 
68.33 
93.78 
84.82 
82.11 
74.02 

83.99 
72.81 
86.24 
82.59 
90.47 
84.52 
93.12 
87.84 

85.80 

90.80 
88.70 

88.57 
86.38 


C 


WFT 

8 

69.54 

65.08 

67.49 

73.00 

59.67 

71.22 

70.81 

71.28 
74.16 

69.28 
83.24 
59.08 

72.82 
71.34 

70.82 
70.96 
90.48 
51.86 
58.32 
78.07 
77.39 
80.18 
70.73 


WFT 

15 

77.49 
80.28 
79.92 
87.60 

68.57 

79.58 
83.24 

77.56 
88.32 
81.03 

84.56 
67.70 

83.28 
79.84 
80.35 
85.75 
90.12 

59.28 
75.27 
90.16 
85.90 
85.21 

80.50 


WFT 

30 

80.73 
78.27 
80.43 
88.88 

76.70 

84.41 

89.73 
80.56 
86.09 

80.41 
89.83 

78.47 
86.59 

80.50 

82.70 
85.89 

95.41 

69.48 
77.86 
91.23 

85.51 
88.26 
83.54 


WFT 

60 

80.19 
71.48 

74.53 

90.15 

82.20 
85.21 

90.54 

82.40 
77.52 
84.12 

93.41 
75.10 
83.97 
80.50 
81.23 
85.21 
93.65 

72.16 
76.95 
88.56 
84.18 
89.79 
82.87 


WFT 

90 

80.19 
74.50 

75.78 

87.14 
83.41 

82.15 

89.19 

82.79 
75.90 
81.44 
89.27 

71.33 
85.37 

79.97 

85.34 

83.97 
90.83 
77.53 
71.30 

86.20 

80.98 
86.74 
81.88 


WFT 

120 

80.19 

74.37 

72.05 

86.67 

76.04 

76.21 

92.16 

80.66 

81.86 

80.62 

86.63 
70.79 
84.32 
80.24 
82.11 
82.74 
88.01 

74.64 
68.85 
85.13 
83.11 
83.08 
80.48 


Table  5.2.  Full  Results  on  CUAVE  Group  Dataset:  C=silence  constraint  imposed  outputs  (non¬ 
speaking  segments  shorter  than  25  frames  removed)  WFT  #  indicates  a  windowed  factorization  test 
with  window  length  Note  that  the  WFT  results  are  obtained  by  finding  a  p-value  threshold  to  give 
the  maximum  mean  accuracy  over  the  entire  dataset.  The  best  result  for  each  sequence  is  highlighted 

in  bold. 


his  or  her  mouth  when  speaking2.  In  our  second  dataset  there  is  also  a  significant 
appearance  change  when  there  is  a  transition  of  who  is  speaking.  When  one  person 
is  speaking  the  other  subject  changes  his  or  her  gaze.  The  results  for  this  sequence 
are  shown  in  Figure  5.8.  Both  the  HFactMM  and  FactMM  greatly  outperform  the 
WFT  as  predicted  in  our  theoretical  analysis.  The  poor  results  of  the  WFT  show  that 
there  is  not  sufficient  dependence  information  in  the  features  at  all  times.  However, 
the  HFactMM  and  FactMM  take  advantage  of  the  static  appearance  differences  (in  this 
case  head  pose)  to  help  group  the  data  and  correctly  label  the  video. 

■  5.4.3  NIST  Sequence 

Lastly,  we  explore  performance  on  the  NIST  data.  Note  that,  in  this  dataset,  when 
neither  tracked  person  is  speaking  there  is  often  an  individual  off-camera  who  is  speaking 
rather  than  simple  silence.  Consequently,  the  independent  factorization,  F3,  must 

2Note  that  this  is  not  entirely  true.  Most  people  are  continuously  moving  their  mouth  in  some  way 

while  anticipating  his  or  her  turn  to  speak  in  the  CUAVE  dataset 


124 


CHAPTER  5.  APPLICATION:  AUDIO-VISUAL  ASSOCIATION 


Ground  Truth 

HFactMM 

Fact  MM 


WFT 
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Figure  5.8.  Results  for  the  “Look  Who’s  Talking”  Data:  White  =  neither  person  is  speaking,  light 
red  =  person  1,  dark  blue  =  person  2.  The  top  row  shows  the  ground  truth.  Results  are  shown  for  the 
HFactMM,  a  FactMM  and  windowed  test.  The  accuracy  of  each  result  is  shown  below  the  labeling. 
Note  here  the  HFactMM  and  FactMM  do  significantly  better  than  the  windowed  approach.  This  is  due 
to  the  fact  they  can  cluster  data  from  far  way  time-points  when  making  a  local  decision. 


model  more  than  just  silence.  Figure  5.9  shows  the  results  of  all  three  approaches  when 
applied  to  this  data.  The  HFactMM  again  achieves  the  best  performance,  but  only 
obtains  76%  accuracy  when  compared  to  the  ground  truth.  We  see  that  the  best  choice 
of  parameters  for  the  WFT  favor  the  independent  factorization.  This,  again,  indicates 
there  is  little  dependence  information  in  the  feature  and  much  is  gained  by  using  a 
dynamic  model. 

This  data  also  reveals  that  the  problem  of  audio-visual  association  is  much  more 
challenging  than  the  controlled  CUAVE  dataset  indicates.  Note  that  we  purposely 
choose  simple  features  and  focused  on  demonstrating  the  advantage  of  using  a  dynamic 
dependence  model  when  treating  this  problem  as  a  generic  dynamic  dependence  analysis 
task.  We  expect  one  can  improve  performance  by  incorporating  better  face  tracking 
and  robust  audio  and  video  features. 

■  5.4.4  Testing  Underlying  Assumptions 

In  the  previous  sections  we  explored  the  use  of  a  dynamic  dependence  model  on  three 
audio-visual  speaker  association  datasets.  We  focused  on  comparing  the  results  obtained 
using  an  HFactMM  with  those  obtained  using  a  windowed  approach.  Performance  was 
discussed  in  the  context  of  the  overall  “difficulty”  of  each  dataset.  Difficulty  was  loosely 
defined  in  terms  of  how  well  each  technique  performed,  the  type  of  speech  activity  and 
the  amount  audio-visual  appearance  change  associated  with  changes  of  speaker  in  the 
data.  Here,  we  provide  some  additional  analysis.  We  explore  how  well  the  underlying 
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Figure  5.9.  Results  for  NIST  data:  White  =  neither  person  is  speaking,  light  red  =  person  1,  dark 
blue  =  person  2.  The  top  row  shows  the  ground  truth.  Results  are  shown  for  the  HFactMM,  a  FactMM 
and  windowed  test.  The  accuracy  of  each  result  is  shown  below  the  labeling.  Again,  here  the  HFactMM 
and  FactMM  significantly  out  perform  the  windowed  approach,  which  sees  very  little  dependence  in  the 
data.  The  dynamic  models  overcome  this  by  pooling  data  over  all  time. 


assumptions  made  when  mapping  the  audio-visual  association  problem  to  a  dynamic 
dependence  analysis  task  hold  for  each  dataset. 

The  main  underlying  assumption  we  make  in  treating  the  speaker  association  prob¬ 
lem  as  a  dynamic  dependence  analysis  task  is  that  dependence  structure  among  the 
observations  is  tied  to  who  is  speaking.  That  is,  when  person  i  is  speaking  we  as¬ 
sume  video  i  is  dependent  on  the  audio  and  independent  of  all  other  observations.  We 
also  assume  there  is  no  dependence  among  the  observations  when  none  of  the  observed 
individuals  are  speaking. 

We  test  this  assumption  on  all  datasets  by  measuring  the  mutual  information  be¬ 
tween  the  audio  and  video  for  each  person  over  various  window  sizes.  That  is  we  measure 
I  (A;  Vi)  for  each  i  £  {1,  2}  within  a  specified  window  of  time.  As  discussed  in  Chapter 
3,  mutual  information  is  a  natural  choice  of  statistic  to  measure  due  to  its  relationship 
to  the  likelihood  of  association  under  the  factorization  model  .  Here,  we  simplify  our 
notation  such  that  the  audio,  A,  within  a  window  tw,  is  the  set  of  both  static  and  dy¬ 
namic  audio  features,  Atw  =  j  .  V^'’  j ,  and  similarly  for  video  Vixw  =  |  j. 

Recall  that  these  audio  and  video  features  were  computed  over  the  entire  sequence.  In 
particular,  this  means  that  the  codebook  has  to  represent  both  speakers  in  the  case  of 
the  audio  measurements. 

Figure  5.4.4  shows  a  summary  of  this  analysis  for  a  single  sequence  in  the  CUAVE 
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dataset.  The  summary  is  broken  down  into  three  rows  of  plots,  each  of  which  cor¬ 
responds  the  state  of  who  is  speaking.  Each  plot  summarizes  the  measured  mutual 
information  for  each  person,  I  (A:  V\ )  and  /(A;  V2),  over  various  window  lengths  given 
a  particular  state.  The  x  axis  indicates  the  window  length.  For  each  window  length 
we  plot  the  mean  and  standard  deviation  of  mutual  information  measured  within  all 
windows  with  that  specified  length  as  a  bar  graph  with  error  bars.  Each  plot  also 
shows  two  horizontal  lines,  one  for  each  person,  which  display  the  mutual  information 
measured  using  all  of  the  data  within  the  specified  state. 

Examining  Figure  5.4.4  we  see  that,  for  this  CUAVE  sequence,  when  neither  person 
is  speaking  our  assumptions  hold  and  there  is  no  dependence  among  the  audio  and 
video  of  either  person  overall  all  such  times  or  within  windows.  The  second  and  third 
rows  show  that  I (A]  Vi)  is  highest  when  person  i  is  speaking.  However,  they  also  show 
that  there  is  some  measured  dependence  between  the  audio  and  the  person  who  is  not 
speaking.  While  within  a  window  this  measured  dependence  may  be  spurious  or  a 
result  of  using  a  small  amount  data,  we  see  that  when  using  all  the  data  (the  horizontal 
lines)  the  dependence  is  still  present.  The  results  for  different  window  lengths  show 
that  variance  of  measured  dependence  decreases  with  window  length  and  that  smaller 
window  lengths  tend  to  have  higher  mutual  information.  These  results  also  show  that 
it  is  the  difference  in  dependence  that  matters,  not  the  presence  of  dependence,  in  this 
sequence. 

Figure  5.4.4  shows  results  for  the  “Look  Who’s  Talking”  data.  We  see  trends  similar 
to  that  of  the  CUAVE  data.  However,  the  there  is  some  measured  dependence  when 
neither  person  is  speaking,  the  measured  dependence  is  lower  over  all  states,  and  the 
difference  in  mutual  information  for  each  speaker  is  less.  That  is,  the  analysis  seems  to 
indicate  there  is  less  dependence  information  in  this  data.  Results  for  the  NIST  data 
are  shown  in  Figure  5.4.4.  We  see  that  while  the  average  dependence  within  windows 
for  all  states  is  generally  higher  than  the  other  two  sets  of  data,  there  is  more  variance 
and  the  mutual  information  measured  using  all  the  data  within  a  state  is  actually 
lower  than  that  in  CUAVE  and  the  “Look  Who’s  Talking”  data  for  each  state.  Such 
characteristics  can  help  explain  why  we  obtained  slightly  worse  performance  on  the 
NIST  sequence  relative  to  the  others. 

Next  we  explore  how  well  we  can  distinguish  the  state  of  who  is  speaking  from  the 
overall  distribution  of  the  data.  While  the  previous  analysis  looked  at  dependence  infor¬ 
mation,  here  we  wish  to  analyze  both  parametric  and  dependence  differences  among  the 
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Figure  5.10.  Dependence  Analysis  for  CUAVE  Sequence  g09:  Each  row  is  a  separate  state.  The  x 
axis  is  the  window  length  used.  For  each  window  length  two  bar  graphs  show  the  average  and  standard 
deviation  of  I(A\  Vi)  (red)  and  J(T;  Vf)  (blue)  measured  using  all  windows  of  the  specified  length  within 
the  specified  state.  The  horizontal  lines  show  I(A;Vi)  (red)  and  I{A\V2 )  measured  using  all  the  data 
within  the  given  state. 


distribution  of  observations  for  each  state.  In  our  model  we  assume  that  distribution 
on  observations  changes  depending  who  is  speaking.  We  test  this  assumption  by  mea¬ 
suring  the  mutual  information  of  the  state  2  with  the  data  T>.  We  also  look  at  I(z ;  A), 
I(z;  V\)  and  I(z;  V2)  to  help  understand  which  observations  are  most  informative  about 
the  state.  Figures  5.13(a),  5.13(b),  and  5.13(c)  shows  results  of  this  analysis  for  each 
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Look  Who's  Talking:  Given  person  2  is  speaking 


Figure  5.11.  Dependence  Analysis  for  the  “Look  Who’s  Talking ”  Sequence:  Each  row  is  a  separate 
state.  The  x  axis  is  the  window  length  used.  For  each  window  length  two  bar  graphs  show  the  average 
and  standard  deviation  of  I{A\  V\)  (red)  and  I{A\  Vf)  (blue)  measured  using  all  windows  of  the  specified 
length  within  the  specified  state.  The  horizontal  lines  show  I(A-,  Vi)  (red)  and  I (A;  Vf)  measured  using 
all  the  data  within  the  given  state. 


dataset.  The  bars  display  the  measured  mutual  information,  while  the  horizontal  line 
is  the  entropy  of  the  state  label  H  (T)  which  serves  as  an  upper  bound.  Figure  5.13(a) 
shows  that  for  a  CUAVE  sequence  the  mutual  information  is  close  to  the  upper  bound 
and  that  most  of  information  comes  from  the  video  observations  rather  than  the  audio. 
Figure  5.13(b)  shows  a  similar  trend  for  the  “Look  Who’s  Talking  dataset”.  However, 
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NIST:  Given  person  1  is  speaking 


^—Person  1 
-  -  -  Person  2 


Window  Length  (frames) 


Window  Length  (frames) 


Figure  5.12.  Dependence  Analysis  for  the  NIST  Sequence:  Each  row  is  a  separate  state.  The  x  axis 
is  the  window  length  used.  For  each  window  length  two  bar  graphs  show  the  average  and  standard 
deviation  of  I(A;  Vi)  (red)  and  J(T;  Vf)  (blue)  measured  using  all  windows  of  the  specified  length  within 
the  specified  state.  The  horizontal  lines  show  I(A;Vi)  (red)  and  I(A\  V2)  measured  using  all  the  data 
within  the  given  state. 


the  mutual  information  is  higher  and  closer  to  the  upper  bound.  Lastly,  Figure  5.13(c) 
shows  that  for  the  NIST  data  the  state  is  less  distinguishable  than  in  the  other  datasets. 
However,  it  is  important  to  again  note  that  this  analysis  is  dependent  on  our  feature 
choices.  Perhaps  features  other  than  MFCCs  might  result  in  stronger  dependence  in 
the  audio  features. 
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(a)  CUAVE  g09 


l(z;D)  l(z;A)  l(z;V1)  l(z;V2) 

(b)  Look  Who’s  Talking 


(c)  NIST 


Figure  5.13.  Analysis  of  State  Distinguishability:  Each  subfigure  shows  the  analysis  for  a  different 
dataset.  The  bars  show  the  mutual  information  between  the  state  label  z  and  all  observations  V  or 
just  the  audio  A  or  video  V1/V2.  The  horizontal  line  is  the  entropy  of  the  state  label  H  (z)  which  is  an 
upper-bound  on  the  mutual  information 
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In  our  experiments,  we  use  an  HFactMM  with  a  single  state  for  each  dependence 
structure,  implicitly  assuming  the  model  is  stationary  when  that  state  is  active.  The 
windowed  method  is  largely  insensitive  to  the  model  varying  so  long  as  the  dependence 
is  measurable.  Next,  we  test  this  stationarity  assumption  by  looking  at  two  different 
windows  of  data  that  both  occur  within  a  specified  state.  We  measure  the  mutual 
information  between  the  label  of  which  window  the  data  came  from  and  the  data. 
Low  mutual  information  indicates  that  the  statistics  are  similar  in  both  windows  and 
indicate  our  assumption  of  stationarity  may  be  true.  Figures  5.14,  5.15,  5.16  show  our 
analysis  results.  Note  that  the  upper  bound  on  mutual  information  is  1  bit  assuming 
each  window  is  equally  likely.  We  plot  the  average  mutual  information  and  variance 
as  bar  plots  for  multiple  window  lengths  for  each  state.  We  break  up  the  analysis  into 
comparing  two  windows  that  occur  within  the  same  utterance  (the  state  label  does  not 
change  any  time  between  when  the  windows  occur)  versus  those  which  occur  in  different 
utterances. 

Figures  5.14,  5.15,  5.16  show  that,  over  all  datasets  and  states,  within  utterance 
stationarity  improves  with  larger  window  sizes.  This  trend  is  most  evident  in  the 
CUAVE  and  “Look  Who’s  Talking”  data,  and  much  less  in  the  NIST  data.  That  is,  even 
within  the  same  utterance  the  distribution  varies  a  lot  in  the  NIST  data.  Windows  with 
different  utterances  seem  to  be  very  distinct  in  general.  This  is  somewhat  unexpected, 
as  one  would  expect  more  of  a  downward  trend  with  larger  window  lengths.  Even 
though  the  utterances  are  different  one  would  expect  the  visual  appearance  and  audio 
characteristics  would  be  similar. 

Reviewing  all  the  analysis  presented  in  this  section  we  now  summarize  our  findings. 
Our  assumptions  about  changing  dependence  structure  more  or  less  hold  in  that  the 
video  for  the  speaker  tends  to  be  most  dependent  on  the  audio.  This  was  first  strongly 
hinted  at  by  the  fact  we  achieved  state-of-the-art  performance  on  each  of  these  datasets. 
Ranking  how  closely  each  dataset  agrees  with  our  assumption  they  can  be  ordered  as: 
CUAVE,  “Look  Who’s  Talking”,  NIST.  Second,  our  assumption  of  each  state  having 
a  distinct  distribution  seems  to  hold  true  as  well.  Ranking  the  datasets  based  on 
this  assumption  we  order  them  as:  “Look  Who’s  Talking”,  CUAVE,  NIST.  This  is 
consistent  with  fact  that  we  achieved  a  large  improvement  in  performance  by  using 
an  HFactMM  rather  than  a  windowed  approach  on  the  “Look  Who’s  Talking”  data. 
Lastly,  our  assumption  of  stationarity  seems  to  be  violated  in  each  data  sequence  we 
explored.  This  raises  the  question  of  whether  this  is  a  poor  assumption  or  that  the 
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Figure  5.14.  Stationarity  Analysis  for  CUAVE  Sequence  g09:  Each  row  is  a  separate  state.  The  x 
axis  is  the  window  length  used.  For  each  window  length,  two  windows  are  extracted  within  the  specified 
state.  For  each  pair  of  windows  the  mutual  information  I(W;T>)  in  bits  is  measured  where  W  is  the 
label  indicating  which  window  the  data  came  from.  We  summarize  the  results  for  each  window  in  two 
bar  plots.  The  first  (green)  gives  the  mean  and  variance  of  the  mutual  information  when  the  windows 
are  within  the  same  utterance.  The  second  (light  blue)  is  the  same  for  when  the  windows  are  in  different 
utterances.  The  blank  bars  for  window  lengths  above  60  frames  in  the  first  plot  are  due  to  lack  of  data 
meeting  those  criteria. 
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1  1  1  1  1 

- 

| 

I 

rh 

m, 

mSm\ 

1—1  Same  Utterance 

Hi  Different  Utterances 

15  30  60  90  120 

Window  Length  (frames) 


1 

0.8 

0.6 

0.4 

0.2 

0 


Look  Who’s  Talking  :  Person  1  Speaking 


^Same  Utterance 
□□  Different  Utterances 


I 


15 


30 


60  90 

Window  Length  (frames) 


120 


Look  Who’s  Talking  :  Person  2  Speaking 


I  l  l  l  I 

- 

-I- 

j 

1 

rF 

- 

^BSame  Utterance 

H  Different  Utterances 

15  30  60  90  120 

Window  Length  (frames) 


Figure  5.15.  Stationarity  Analysis  for  the  “Look  Who’s  Talking”  Sequence:  Each  row  is  a  separate 
state.  The  x  axis  is  the  window  length  used.  For  each  window  length,  two  windows  are  extracted  within 
the  specified  state.  For  each  pair  of  windows  the  mutual  information  I(W ;  T>)  in  bits  is  measured  where 
W  is  the  label  indicating  which  window  the  data  came  from.  We  summarize  the  results  for  each  window 
in  two  bar  plots.  The  first  (green)  gives  the  mean  and  variance  of  the  mutual  information  when  the 
windows  are  within  the  same  utterance.  The  second  (light  blue)  is  the  same  for  when  the  windows  are 
in  different  utterances.  The  blank  bar  for  different  utterances  of  window  length  120  is  due  to  lack  of 
data  meeting  those  criteria. 
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Figure  5.16.  Stationarity  Analysis  for  the  NIST  Sequence:  Each  row  is  a  separate  state.  The  x  axis  is 
the  window  length  used.  For  each  window  length,  two  windows  are  extracted  within  the  specified  state. 
For  each  pair  of  windows  the  mutual  information  I (W ;  V)  in  bits  is  measured  where  W  is  the  label 
indicating  which  window  the  data  came  from.  We  summarize  the  results  for  each  window  in  two  bar 
plots.  The  first  (green)  gives  the  mean  and  variance  of  the  mutual  information  when  the  windows  are 
within  the  same  utterance.  The  second  (light  blue)  is  the  same  for  when  the  windows  are  in  different 
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lack  of  stationarity  is  a  result  of  the  features  we  choose  or  particular  data  we  looked  at. 
Either  way,  it  suggests  some  future  avenues  to  explore  when  analyzing  and/or  extending 
the  models  proposed  in  this  dissertation. 

■  5.5  Summary 

In  this  chapter,  we  looked  at  the  problem  audio-visual  speaker  association.  We  cast 
the  problem  in  terms  of  dynamic  dependence  analysis  using  an  HFactMM  to  reason 
over  factorizations.  Consistent  with  the  theoretical  analysis  provided  in  Chapter  4, 
we  have  empirically  shown  that  by  modeling  dependence  as  a  dynamic  process  the 
HFactMM  can  exploit  both  structural  and  parameter  differences  to  distinguish  between 
hypothesized  states  of  dependence.  This  is  in  contrast  to  sliding  window  methods  which 
can  only  discriminate  based  on  structural  differences.  State-of-the-art  performance  was 
obtained  on  a  standard  dataset  for  audio-visual  association.  Significantly,  this  was 
achieved  without  benefit  of  training  data. 
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Chapter  6 


Application:  Object  Interaction 

Analysis 


In  this  chapter,  we  use  dynamic  dependence  models  for  the  purpose  of  analyzing  the 
interactions  among  multiple  moving  objects.  Such  analysis  can  provide  information 
and  insight  into  coupled  behavior  for  automated  surveillance  and  tracking  of  multiple 
objects  [85,  20].  Much  of  the  past  work  in  this  area  has  focused  on  modeling  and 
classifying  object  activity  [39,  2,  14],  specific  single  object  behaviors  [15,  9,  71]  and 
anomalous  events  [12,  66].  In  contrast,  we  focus  on  the  unsupervised  identification  of 
interactions  among  multiple  moving  objects. 

Our  approach  is  closely  related  to  the  recent  work  of  Tieu  [88]  who  has  formulated 
the  problem  of  detecting  and  describing  interaction  explicitly  in  terms  of  statistical  de¬ 
pendence  and  model  selection.  That  is,  interactions  are  defined  in  terms  of  dependence 
structure  rather  than  specific  behavior  models  learned  from  training  data.  Represent¬ 
ing  object  trajectories  as  time-series,  [88]  assumes  a  single  dependence  structure  among 
them  over  all  time.  The  strength  of  interactions  are  described  in  terms  of  statisti¬ 
cal  measures  of  dependence.  While  similar  to  this  preliminary  proof  of  concept,  our 
approach  differs  in  that,  here,  we  are  primarily  concerned  with  characterizing  a  full 
posterior  distribution  over  interactions  described  by  the  structure  of  a  graphical  model. 
More  importantly,  using  the  dynamic  dependence  models  presented  in  Chapter  4,  we 
efficiently  reason  over  a  larger  set  of  interaction  structures  which  are  allowed  to  evolve 
over  time. 

We  begin  in  Section  6.1  by  casting  the  problem  of  reasoning  over  object  interactions 
as  one  of  Bayesian  inference  over  the  structure  of  a  temporal  interaction  model  (TIM). 
We  discuss  how  to  parameterize  a  TIM  and  provide  an  illustrative  example  using  data 
generated  from  the  model.  A  series  of  experiments  on  various  datasets  of  moving  objects 
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♦  <S> 

1  and  2  are  moving  independently  1  and  2  are  influencing  each  other 


1  and  2  are  not  interacting 

(a) 

1  and  2  are  moving  together 
(b) 

1  is  influencing  2 

2  is  influencing  1 

2  is  following  1 

(c) 

1  is  following  2 

(d) 

Figure  6.1.  Interaction  Graphs  for  Two  Objects:  Four  possible  directed  interaction  graphs  describing 
the  behavior  of  two  objects.  Under  each  graph  are  two  descriptions  of  object  interactions  that  would 
yield  the  structure  above. 

are  conducted  and  analyzed  in  Section  6.2. 

■  6.1  Modeling  Object  Interaction  with  a  TIM(r) 

Consider  N  objects  moving  within  an  environment.  We  assume  we  are  given  information 
jointly  about  each  object’s  location  over  a  regularly  sampled  time  interval,  1  to  T. 
Time  indexes  the  trajectory  of  the  z-th  object  as  time  series  xj.T,  which  describes 
the  kinematic  state1  of  that  object  at  each  point  in  time.  We  map  the  problem  of 
identifying  interactions  among  these  objects  to  a  dynamic  dependence  analysis  task 
using  these  observed  time-series.  Specifically,  we  use  an,  r-th  order  temporal  interaction 
model,  TIM(r),  whose  directed  structure  E  describes  the  causal  relationships  among 
the  positions  of  the  N  objects  under  analysis. 

For  example,  consider  the  case  in  which  there  are  only  two  objects.  Figure  6.1 
shows  the  four  possible  direct  structures  E  depicted  as  interaction  graphs.  Underneath 
each  interaction  graph  are  two  possible  semantic  descriptions  of  interactions  that  may 
yield  the  depicted  structure  above.  However,  one  must  be  careful  to  not  invert  the 
relationship  and  claim  a  semantic  description  can  be  read  from  the  interaction  graph 
alone.  While  generic  terms  such  as  “influence”  may  be  appropriately  read  from  the 
interaction  graph,  other  terms  such  as  “follow”  cannot.  That  is,  a  directed  edge  from 


1In  our  experiments,  kinematic  state  is  simply  the  2-D  position  of  the  object. 
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1  to  2  indicates  that  the  observation  for  object  2  at  the  current  time  is  dependent  on 
or  “influenced”  by  the  past  of  object  1.  Terms  such  as  “follow”  are  more  complex  and 
involve  describing  the  nature  of  this  particular  influence.  That  is,  the  label  “2  follows 
1”  is  usually  interpreted  to  mean  object  2  is  physically  behind  object  1.  Such  a  label 
cannot  be  read  from  the  dependence  graph  in  Figure  6.1(c).  The  structure  depicted 
can  also  be  associated  with  an  interaction  involving  “2  stays  ahead  of  1”.  The  missing 
information  telling  us  which  description  is  more  appropriate  is  captured  by  the  value 
of  the  parameters  associated  with  the  edge  from  1  to  2  (which  we  treat  as  a  nuisance). 

In  other  words,  E,  as  depicted  in  the  interaction  graph,  only  describes  the  structure 
of  the  interaction,  while  the  parameters  associated  with  that  structure  describe  the 
nature  of  the  interaction.  In  this  chapter,  we  are  primarily  concerned  with  quantify¬ 
ing  uncertainty  in  the  structure  of  the  interaction  given  observed  data.  However,  as 
discussed  in  Chapter  4,  the  parameters  describing  to  nature  of  object  interactions  will 
help  to  pool  information  when  reasoning  over  interactions  that  may  change  over  time. 
If  interested  in  characterizing  the  nature  of  interaction,  priors  can  be  chosen  for  the 
parameters  which  favor  certain  behaviors  and  then  the  posterior  on  these  parameters 
can  be  calculated  given  the  observed  data  as  discussed  in  Section  3.4.3. 

■  6.1.1  Parameterization 

Before  we  can  use  a  TIM(r)  to  infer  interactions,  we  need  to  supply  its  parameterization. 
That  is,  we  need  to  choose  a  parametric  form  for  p  ^X*|Xf ,  E,  that  can  describe 
trajectories  of  objects.  Recall  from  Section  3.4  that  a  TIM  factorizes  as 

N 

p  (x*| xt,£,e)  =  (xZIST*0 ,^|pa(,))  •  (6-1) 

V=1 

Thus,  ultimately  we  must  describe  the  form  of  p  ©„|s)  that  models  the  de¬ 

pendence  of  a  time-series  v  given  its  past  in  addition  to  the  past  of  a  parent  set  S. 

We  adopt  a  simple  r-th  order  vector  auto-regressive  model,  VAR(r),  for  these  con¬ 
ditional  relationships  [70].  While  other  models  are  possible,  this  particular  choice  is 
useful  as  it  leads  to  a  tractable  conjugate  prior.  That  is,  assuming  x)  describes  the 
d-dimensional  position  of  object  i  at  time  t  (x)  £  Rd),  we  adopt  the  following  linear 
model 


v  a  ~v,S  i 
Xt=Av\S*t  +W, 


(6.2) 
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where  Ay|g  is  a  dx  d(  1  +  |S|)r  dynamic  matrix  and  w  is  zero-mean  Gaussian  noise  with 
dx  d  covariance  A„ig.  The  parameter  0„|s  =  {A„|g, A^,|s } •  Recall  that  x),,S  contains 
the  past  r  values  of  time-series  v  in  addition  to  the  past  values  of  all  time-series  in  set 
S.  The  VAR(r)  model  describes  the  current  value  of  time-series  v  as  a  linear  function 
of  these  past  values  plus  Gaussian  noise.  Here,  the  parameter  A^g  captures  the  nature 
or  form  of  dependence. 

Consider  a  simple  case  with  r  =  1  in  which  time-series  v  has  single  parent  u.  In  this 
situation,  Equation  6.2  is 


x 


V 

t 


X 


v 

t- 1 


X 


u 

t- 1 


+  W 


(6.3) 


which  is  a  simple  VAR(l)  model.  Higher  order  models  (r  >  1)  can  be  used  to  cap¬ 
ture  higher  dependence  on  motion.  For  example  a  VAR(2)  can  be  used  to  capture 
dependence  on  velocity  when  x£  represents  position. 

In  this  chapter,  we  will  focus  on  Bayesian  inference  of  structure  while  integrating 
out  parameters.  We  wish  to  define  prior  on  these  parameters  so  that  one  can  efficiently 
integrate  over  them..  Equation  6.2  is  a  linear  regression  model.  As  mentioned  in 
Chapter  2  a  conjugate  prior  for  the  parameters  of  this  model  is  the  matrix-normal- 
inverse-Wishart  distribution: 


Po  (@u|s)  —  Po  (At,|s|A„jS)  po  (A„jS) 

(6.4) 

MM  (A„|S,  ,  A„|S,  K„|S)  IW  ( A„|g ,  g ,  T'uis) 

Thus,  posterior  updates  given  observed  data  V  can  be  performed  using  Equation  2.53, 
and  the  evidence  Ws,v  can  be  calculated  using  the  Matrix- T  distribution  given  in  Equa¬ 
tion  2.59.  This  is  done  by  mapping  T>y  =  T>v,  Vx  =  T>V'S  and  the  hyper-parameters 
II,  K,  A,  and  v  to  their  w|S  indexed  versions  in  Equation  6.4. 

While  all  of  our  experiments  will  involve  one  or  two-dimensional  position  informa¬ 
tion,  it  is  straightforward  to  extend  our  approach  to  higher  dimensional  data. 

■  6.1.2  Illustrative  Synthetic  Example  of  Two  Objects  Interacting 

In  this  section,  we  perform  an  illustrative  experiment  on  two  (N  =  2)  one-dimensional 
time-series.  We  wish  to  examine  the  degree  to  which  one  can  distinguish  whether 
a  certain  dependence  relationship  is  present  as  a  function  of  the  number  of  samples 
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and  strength  of  that  dependence.  We  design  a  TIM(l)  with  a  static  structure  of  xj 
influencing  x2 * *.  That  is,  E  contains  a  single  edge,  1  — ►  2,  as  depicted  in  Figure  6.1(c). 

The  time-series  xj  is  set  to  be  a  random  walk  such  that  Aqg  =  1  and  =  1. 
That  is, 


x*  =  Xt_i  +  w 


(6.5) 


where  w1  is  zero  mean,  unit  variance  Gaussian  noise.  The  amount  of  influence  xj.  has 
on  x2  is  controlled  via  a  variable  p  such  that  A2\\  =  (1  —  p)  p  and  A2|  i  =  1.  That 
is, 


xf2  =  (1  -  pWt-i  +  P*t- 1  +  w 2  (6-6) 

where  w2  is  unit  variance.  Note  that  if  p  =  1,  x2  is  the  position  x/^  plus  noise,  and  if 
p  =  0  it  simply  follows  its  own  random  walk  independent  of  the  other  time-series. 

Using  a  weak  matrix-normal-inverse-Wishart  prior  on  the  parameters2  and  a  uniform 
prior  on  structure  (all  (3  set  to  1),  we  calculated  the  the  posterior  probability  of  edge 
1  — >  2  given  a  set  of  samples  from  our  model.  This  is  done  for  various  settings  of 
p  and  T.  For  each  setting,  100  trials  are  performed  and  the  average  posterior  edge 
probabilities  are  recorded.  Note  that,  for  N  =  2,  the  set  of  all  structures  A2  (set  of  all 
structures)  =  V\  (set  of  single  parent  structures)  and  the  number  of  structures  in  each 
set  is:  17^1  =  2  (directed  trees),  l-T^I  =  3  (directed  forests)  and  \A2\  =  4  (all  directed 
structures) . 

Figure  6.2(a)  and  Figure  6.2(b)  show  results  for  structure  sets  T2  and  A2  respectively. 
The  results  follow  one’s  intuition:  few  samples  or  p  <  0.1  results  in  an  edge  posterior 
close  to  chance,  i.e.  a  uniform  posterior  over  structure  (1/2  for  7^,  1/4  for  *42).  Once 
p  >  0.1  there  is  a  sharp  increase  in  the  posterior  on  the  the  edge  being  present. 

Note  that,  here,  we  are  quantifying  uncertainty  in  terms  of  posterior  edge  appearance 
probabilities.  As  discussed  in  Section  3.4.3  the  posterior  probability  of  edge  appearance 
is  the  same  as  the  posterior  mean  of  a  multiplicative  indicator  function  /  on  that  edge, 

2 Throughout  this  chapter,  when  we  use  the  term  “weak”  prior  for  matrix-normal-inverse-wishart 

distribution  we  generally  set  the  prior  to  have  the  following  parameters  for  all  conditional  relationships 

r>|S:  The  matrix-normal  is  set  to  be  zero  mean  Q  =  0  and  K  set  to  be  the  identity  matrix.  For  the 

inverse-wishart  we  set  the  degrees  of  freedom  v  to  be  the  dimension  of  the  data  plus  2.  S  is  either  set 
to  be  the  identity  for  synthetic  data  or  2A  where  A  is  the  ML  estimate  of  the  covariance  given  all  of  the 
data  being  analyzed.  These  priors  are  “weak”  in  the  sense  that  they  do  not  heavily  bias  the  posterior 

and  the  data  term  easily  dominates  the  prior. 
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(a)  T2  (b)  .42  =  V\ 

Figure  6.2.  The  Average  Edge  Posterior  for  Different  p  and  #  of  samples  T:  The  average  was  taken 
over  100  trials  for  each  setting  of  p  and  T.  (a)  shows  results  when  we  restrict  E  £  72  (b)  shows  results 
when  there  are  no  restrictions  placed  on  E. 

Ep(is|x>)  [f(E)]-  One  can  also  calculate  higher-order  information  such  as  the  posterior 
variance  using  IE  /  e\t>)  [ /( -E-)2] .  For  example,  running  a  single  trial  with  T  =  50,  p  =  0.1 
one  obtains  a  posterior  edge  probability  of  .6327  and  posterior  variance  of  .2324  when 
considering  Tj. 

■  6.2  Experiments 

In  this  section  we  present  a  series  of  experiments  on  datasets  involving  the  interaction  of 
multiple,  N  >  2,  tracked  moving  objects,  At  first,  we  look  at  a  simple  static  dependence 
analysis  task  and  then  explore  situations  in  which  interactions  are  changing  over  time. 
Specifically,  we  are  interested  in  quantifying  uncertainty  in  the  dependence  structure 
describing  interactions  rather  than  obtaining  point  estimates.  When  analyzing  data 
we  do  not  assume  there  is  a  “true  or  correct  structure”,  rather,  our  goal  is  to  fully 
characterize  the  uncertainty  over  the  structure  of  interactions. 

■  6.2.1  Static  Interactions 

We  begin  by  considering  a  dataset  comprised  of  recordings  of  three  individuals  playing 
a  simple  interactive  computer  game.  This  “Interaction  Game”  is  similar  to  that  used 
by  Tieu  [88].  Multiple  players  are  each  given  a  screen  to  look  at  which  displays  multiple 
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006 

lXl  Figure  1:  Interaction  Game 

1  of  3  Players 

• 

Figure  6.3.  A  screenshot  from  our  “Interaction  Game”:  Each  player  is  represented  by  a  round  marker. 
Here,  we  see  the  interface  from  the  point  of  view  of  player  1. 


numbered  markers.  Each  player’s  computer  mouse  controls  the  position  of  one  specific 
marker  on  their  screen  while  the  other  markers  represent  the  positions  of  the  other 
players.  A  screenshot  of  the  interface  for  this  interaction  game  is  shown  in  Figure 
6.3.  Three  players  are  instructed  to  perform  a  particular  interactive  behavior.  The 
position  of  each  player  is  recorded  at  8  samples  per  second.  Each  behavior  is  recorded 
for  approximately  30  seconds,  T  ~  240. 

The  players  are  first  instructed  to  follow  each  other  in  order.  That  is,  player  1  is  told 
to  move  their  marker  around  the  screen  while  player  2  is  instructed  to  follow  player  1. 
Player  3  is  instructed  to  follow  player  2.  Using  a  uniform  prior  on  structure  and  weak 
matrix  -  normal  -  inverse  -  Wishart  prior  on  parameters,  the  posterior  on  structure  is 
calculated  given  this  data  using  a  TIM(l).  For  increasingly  restrictive  sets  of  allowable 
structures  results  are  shown  in  Figure  6.4(a). 

Visualization  of  the  uncertainty  over  interaction  structure  is  challenging  in  general. 
This  fact  will  motivate  the  use  of  more  focused  probabilistic  evaluations  as  the  data  sets 
we  consider  grow  in  complexity.  Here,  we  visualize  the  resulting  posterior  as  weighted 
interaction  graphs.  In  these  graphs,  edge  intensity  represents  the  posterior  probability 
of  that  edge  (darker  is  higher).  Node  intensity  represents  the  probability  a  time-series 
has  no  parents  and  is  a  root.  White  indicates  a  probability  of  0  while  black  indicates 
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a  probability  of  1.  Note  that  this  type  of  following  behavior  is  described  well  by  all 
sets  of  structures  we  consider.  The  posterior  is  the  most  peaked  (exhibits  the  least 
uncertainty)  when  considering  the  most  restrictive  set,  directed  trees  7-$. 

In  a  different  sequence  players  2  and  3  are  instructed  to  move  freely,  while  player  1  is 
told  to  follow  player  2.  The  resulting  posteriors  are  shown  in  Figure  6.4(b).  While  the 
posterior  on  structure  agree  with  one’s  intuition,  there  is  some  uncertainty  as  to  wether 
or  not  the  edge  1  — >  2  exists.  Given  the  short  sequence  of  data,  this  uncertainty  may 
be  related  to  disambiguating  “1  following  2”  versus  “2  evading  1”.  Note  this  behavior 
is  not  described  well  by  a  directed  spanning  tree  since  player  3  is  independent  of  all 
others. 

Lastly,  players  2  and  3  are  told  to  move  freely  while  player  1  does  his  best  to  stay 
between  both  of  them.  The  results  in  Figure  6.4(c)  show  that  this  behavior  only  seems 
represented  well  in  set  ^3;  it  is  the  only  set  that  allows  more  than  a  single  parent  for 
a  time-series.  Restricting  the  the  structures  to  have  no  more  than  one  parent  leads  to 
the  posterior  using  sets  V\  and  JF 3  to  strongly  favor  an  independent  explanation.  This 
independence  is  not  allowed  in  the  set  of  trees,  Tj,  and  thus  the  posterior  exhibits  the 
most  uncertainty. 

■  6.2.2  Dynamic  Interactions 

Next  we  consider  situations  in  which  interaction  changes  over  time.  The  first  set  of 
experiments,  again,  involve  data  collected  from  our  interaction  game.  However,  this 
time  the  players  are  told  to  switch  among  a  specific  set  of  behaviors.  The  second  set  of 
experiments  explores  data  obtained  from  a  recorded  basketball  game. 

Follow  the  Leader 

Using  the  same  interactive  game  setup,  we  record  three  individuals  playing  a  game  of 
follow  the  leader.  One  player  is  designated  the  leader.  The  leader  moves  his  or  her 
marker  randomly  around  the  screen  while  the  other  players  are  instructed  to  follow. 
Here,  the  designated  leader  changes  throughout  the  game.  That  is,  a  fourth  person 
observing  the  game  tells  the  players  when  to  switch  leaders.  In  this  case,  the  latent 
variable  indicating  the  change  of  leader  is  known,  and  consequently,  nominal  ground 
truth  is  available  by  which  to  evaluate  performance. 

We  begin  by  performing  dynamic  dependence  analysis  on  this  sequence  using  a 
switching  temporal  interaction  model  STIM(1,3)  with  E  £  Tj.  That  is,  we  will  use 
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(a)  3  follows  2  follows  1 


(b)  1  follows  2 


^3 


A  ♦ 

❖  ❖❖  ❖ 

(c)  1  stays  between  2  and  3 


Figure  6.4.  Resulting  Posterior  Interaction  Graphs  for  Specific  Behaviors:  The  results  for  each  specific 
behavior  enacted  by  three  players  using  the  “Interaction  Game”  are  shown  as  weighted  interaction 
graphs.  The  darkness  of  a  node  in  the  graph  is  probability  of  that  time-series  being  a  root.  The  darkness 
of  an  edge  is  a  function  of  the  probability  of  that  edge  being  present.  Black  represents  probability  1 
while  white  is  probability  0.  Each  subfigure/row  is  a  separate  behavior.  Each  column  shows  results 
when  E  is  restricted  to  be  a  specific  set.  From  left  to  right  the  columns  represent  increasingly  restricted 
sets  starting  with  the  set  of  all  structures  to  the  set  of  directed  trees. 


the  fact  there  are  only  three  possible  leaders  and  knowledge  that  directed  trees  may 
sufficiently  describe  the  interaction  among  players.  A  uniform  prior  on  structures  and 
equivalently  weak  prior  on  parameters  is  used.  A  weak  self  biased  prior  on  the  state 
transition  distribution  is  imposed  with  a  bias  towards  self  transition. 

Given  the  data  and  the  prior  model,  100  samples  of  the  structure,  parameters  and 
the  hidden  state  sequence  are  obtained  with  a  Gibbs  sampler  as  described  in  Section 
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Figure  6.5.  Samples  of  z  Using  a  STIM(1,3)  on  the  Follow  the  Leader  Dataset:  The  very  top  plot 
shows  the  ground  truth  with  color  indicating  who  was  designated  the  leader.  Leaders  changed  in  order 
from  player  1  (blue)  to  player  3  (red).  100  state  sequences  drawn  from  our  Gibbs  sampler  are  displayed 
below  the  ground  truth.  The  state  sequence  samples  are  ranked  according  to  their  log  likelihood  with 
the  top  state  sequence  representing  the  sample  with  the  highest  likelihood.  The  state  sequence  labels 
were  permuted  such  that  they  minimized  the  hamming  distance  to  the  ground  truth.  This  provides  a 
consistent  coloring.  The  hamming  distance  for  each  state  sequence  sample  is  shown  on  the  right.  The 
hamming  distance  is  highly  correlated  with  the  log  likelihood. 


4.4.  The  two  primary  steps  of  the  sampler  are: 

1.  Jointly  sampling  a  state  sequence  condition  on  the  data  and  the  current  sample 
of  parameters  and  structure. 

2.  Jointly  sample  the  structure  and  parameters  using  exact  updates  of  the  posterior 
conditioned  on  the  data  and  the  sampled  state  sequence. 

Burn-in  required  approximately  60  iterations,  with  little  change  in  the  sampled  state 
sequence  or  data  likelihood  in  future  iterations.  A  detailed  view  of  the  results  can  be 
found  in  Figure  6.5.  The  ground  truth  state  sequence  is  shown  on  top  with  players 
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taking  turns  being  the  leader  in  order.  This  figure  shows  each  of  the  Gibbs  sampled 
state  sequences.  The  labels  were  permuted  to  give  a  consistent  coloring  with  the  ground 
truth  segmentation  shown  on  the  top.  The  results  are  ranked  by  the  log  likelihood  of  the 
data  given  the  sampled  parameters  and  structures,  with  the  top  being  the  most  likely. 
The  side  of  the  plot  shows  the  normalized  Hamming  distance  of  the  best  mapping  to 
the  ground  truth  [43].  Note  that  this  error  metric  is  highly  correlated  with  the  log 
probability  of  the  data.  Each  sample  falls  in  one  of  two  general  categories.  The  top 
third  of  the  samples  match  the  ground  truth  closely,  the  bottom  two  thirds  suggests  a 
consistent  alternative  explanation. 

By  themselves,  the  state  labels  only  indicate  a  change  of  distribution  over  structure. 
While  this  segmentation  information  is  useful  and  interesting,  we  are  interested  in  the 
details  of  the  distribution.  Given  this  segmentation  we  look  at  the  posterior  distribution 
on  structure  to  analyze  the  interaction  among  the  players.  Figures  6.6  and  6.7  show  a 
more  detailed  breakdown  of  two  sampled  models.  The  first  row  of  each  figure  shows  the 
most  likely  segmentation  given  the  model.  The  second  row  shows  weighted  interaction 
graphs  representing  the  posterior  probability  of  the  structure  for  each  state.  Recall  that 
while  the  state  sequence  is  sampled  using  an  MCMC  method,  via  the  details  outlined 
in  Section  3.4.3,  we  can  obtain  an  exact  posterior  conditioned  on  this  sequence. 

Figure  6.6  is  a  sample  with  low  Hamming  distance  and  high  log  likelihood.  Notice 
that  the  posterior  distribution  on  structure  for  each  state  is  essentially  a  delta  function 
on  three  distinct  structures.  These  structures  agree  with  our  intuition  in  that  each  root 
is  consistent  with  who  was  designated  as  the  leader  and  the  followers  are  conditionally 
independent  given  the  root. 

Figure  6.7  is  a  sample  with  a  mid-range  Hamming  distance  (ranked  34  out  the  100 
samples).  It  has  errors  consistent  with  the  majority  of  sequences  shown  in  6.5.  The 
confusion  between  the  first  and  third  state  is  most  noticeably  reflected  in  the  structure 
posterior  for  state  3.  The  above  analysis  assumed  three  states,  consistent  with  our 
knowledge  of  the  ground  truth,  Figure  6.7  gives  evidence  for  additional  states.  That 
is,  for  each  phase  of  the  game  a  better  model  may  be  a  mixture  of  processes  each  with 
similar  structure  but  different  parameter  distributions.  In  order  to  test  this  hypothesis, 
we  repeat  the  experiment  using  K  =  6  states.  Figure  6.8  is  a  sample  from  this  model. 

The  second  row  shows  the  occurrence  of  the  learned  states.  We  see  that  the  ambi¬ 
guity  in  Figure  6.7  is  resolved  by  splitting  the  first  and  third  state  into  two  different 
models,  each  with  a  posterior  on  structure  consistent  with  the  ground  truth.  Interest- 
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Hamming  Dist  =  0.02,  logProb=-1.225587e+03 


50  100  150  200  250  300  350  400  450  500  550 


Figure  6.6.  Highest  Log  Likelihood  Sample  for  Follow  the  Leader  Data:  The  sampled  state  sequence 
with  the  highest  log  likelihood  out  of  100  samples  drawn  using  a  STIM(1,3)  is  shown  at  the  top.  The 
tree  graphs  under  the  state  sequence  depict  the  posterior  over  structure  for  each  state  in  terms  of  a 
weighted  interaction  graph.  The  states  are  lined  up  with  the  sampled  state  sequence.  Note  that  this 
sample  closely  matches  the  ground  truth  and  the  posterior  on  structure  for  each  state  peaked  at  a 
structure  which  is  indicative  of  who  was  leading. 


Hamming  Dist  =  0.22,  logProb=-1 .493586e+03 


50  100  150  200  250  300  350  400  450  500  550 


Figure  6.7.  Typical  Sample  for  Follow  The  Leader  Data:  Here  the  the  sample  state  sequence  ranked 
34th  in  terms  of  log  likelihood  out  of  100  samples  drawn  using  a  STIM(1,3)  is  shown  at  the  top.  The 
tree  graphs  under  the  state  sequence  depict  the  posterior  over  structure  for  each  state  in  terms  of  a 
weighted  interaction  graph.  The  states  are  lined  up  with  the  sampled  state  sequence.  Note  there  is 
some  disagreement  with  the  ground-truth  and  confusion  between  states  1  and  3.  This  reflected  in  some 
posterior  uncertainty  in  the  structure  for  state  3. 
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LogProb— 1 .084602e+03 


Figure  6.8.  Sample  Drawn  using  a  STIM(1,6)  on  the  Follow  the  Leader  Data:  The  top  row  displays 
the  sampled  state  sequence.  The  second  row  shows  the  amount  of  time  each  state  was  active  in  this 
sampled  sequence.  The  third  row  shows  posterior  interaction  graphs  for  each  state  aligned  with  the 
states  in  the  middle  row.  Note  that  this  sample  explained  player  1  being  the  leader  with  two  states, 
each  with  a  strong  posterior  time-series  1  being  the  root.  States  5  and  6  both  occur  when  player  3  is 
the  leader  with  some  uncertainty  in  structure  for  state  5.  State  4  is  never  used  in  the  sample  and  thus 
the  posterior  over  structure  is  uniform  for  that  state. 


ingly,  state  4  indicates  uniform  uncertainty  in  structure.  However,  this  state  is  never 
used  and  our  prior  is  uniform,  thus  its  posterior  distribution  remains  uniform.  This 
suggests  that  perhaps  5  states  were  sufficient  to  capture  the  dynamic  behavior  of  the 
observations. 

Basketball 

Next  we  consider  analysis  of  player  interactions  in  sports  data.  Automatic  analysis  of 
player  interactions  could  be  a  valuable  tool  for  understanding  and  learning  individual 
player  and  team  strategies.  That  is,  information  about  the  pattern  and  structure  of 
interactions  among  teammates  and  their  opposition  could  be  used  to  learn  a  playbook 
for  each  team.  Here,  we  focus  on  the  core  task  of  characterizing  player  interactions  and 
leave  the  lofty  goal  of  extracting  strategic  information  for  future  work. 

We  explore  a  basketball  game  recording  from  the  CVBASE  06  dataset  [75].  Players 
are  tracked  in  two  cameras  and  their  positions  are  mapped  to  a  common  coordinate 
system  and  recorded  at  each  frame.  Tracking  is  performed  using  a  simple  template 
correlation  tracker  which  is  corrected  by  hand  using  an  interactive  tool.  A  total  of  11 
tracks  are  obtained;  five  players  on  each  team  plus  the  ball.  A  coarse  annotation  of  the 
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Figure  6.9.  Sample  Frame  from  Basketball  Data:  The  top  two  images  show  the  raw  input  from  the 
two  cameras  recording  the  game.  The  bottom  image  shows  the  un-warped  common  coordinate  frame 
with  each  player  labeled. 


current  phase  of  the  game  is  also  created.  Four  phases  are  noted:  team  A  on  offense, 
team  B  on  offense,  team  A  transitioning  to  offense,  and  team  B  transitioning  to  offense. 
A  sample  frame  with  player  positions  is  shown  in  Figure  6.9. 

A  STIM(2,10)  model  with  E  £  T  is  used  for  analysis;  a  second  order  model  incor¬ 
porates  velocity  information.  Again,  we  use  a  weak  prior  on  parameters  and  uniform 
prior  on  structures.  Note  that  here,  with  11  time-series  and  E  £  T,  we  are  reasoning 
over  ll10  structures  for  each  state.  However,  as  shown  in  Chapter  3,  given  a  state,  one 
can  reason  over  these  structures  with  approximately  ll3  operations.  That  is,  we  can 
reason  over  25  billion  structures  with,  on  the  order  of,  2  thousand  computational  steps. 

A  sampled  state  sequence  obtained  from  a  Gibbs  sampler  is  shown  in  the  bottom 
row  of  Figure  6.10.  The  middle  row  shows  the  best  many-to-one  mapping  of  the  sampled 
state  sequence  to  the  coarse  annotation.  That  is,  it  uses  a  mapping  which  minimizes 
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Figure  6.10.  Sampled  State  Sequence  for  the  Basketball  Data:  Sampled  a  from  STIM(2,10)  compared 
with  the  annotation  describing  which  phase  of  the  game  is  currently  active.  The  top  row  shows  the 
annotation  which  each  annotated  state  using  a  different  color.  The  bottom  row  shows  the  sampled 
state  sequence  z.  The  middle  row  shows  the  many-to-one  mapping  from  the  sampled  sequence  to  the 
annotation  that  minimizes  hamming  distance. 


Table  6.1.  Basketball  Results  Showing  Probability  of  Root:  The  probability  of  being  a  root  during  the 
four  annotated  phases  of  the  game  is  shown  for  each  team  and  the  ball,  bold  and  underlined  values 
are  the  maximum  and  second  highest  in  each  column  respectively 


A  on  offense 

B  on  offense 

B  transitioning 

to  offense 

A  transitioning 

to  offense 

Team  A 

.2475 

.0832 

.0420 

.4769 

Team  B 

.0468 

.2902 

.4234 

.0459 

Ball 

.7057 

.6266 

.5346 

.4771 

the  hamming  distance.  Note  that  while  the  sampled  sequence  is  somewhat  predictive 
of  the  annotation,  a  direct  comparison  is  misleading  as  within  each  phase  of  the  game 
multiple  structures  may  be  active. 

Given  a  sampled  state  sequence,  the  posterior  on  E  is  obtained  for  each  point  in 
time.  With  11  time-series  and  10  states,  displaying  all  posterior  interaction  graphs  is 
impractical.  As  an  alternative,  we  focus  on  calculating  posterior  event  probabilities 
over  time  intervals.  Table  6.1  shows  the  probability  of  the  root  being  on  either  team 
or  the  ball  given  each  phase  in  the  coarse  annotation.  Over  all  phases  of  the  game  the 
ball  has  the  highest  probability  of  being  the  root.  When  on  offense  or  transitioning  to 
offense  a  player  on  team  A  has  a  higher  probability  of  being  the  root  than  one  on  B 
and  vice  versa.  Similarly,  Table  6.2  shows  the  probability  of  being  a  leaf  averaged  over 
each  team  and  the  ball.  Here  the  ordering  is  reversed  and  has  a  connection  to  which 
team  is  on  defense.  We  see  how  analysis  of  these  posterior  event  probabilities  can  give 
one  a  statistical  prediction  of  the  state  of  the  game. 
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Table  6.2.  Basketball  Results  Showing  the  Average  Probability  of  Being  a  Leaf:  The  average  probability 
someone  on  Team  A,  Team  B,  or  the  ball  is  a  leaf  during  the  four  annotated  phases  of  the  game  is 
shown,  bold  values  are  the  maximum  in  each  column  respectively 


A  on  offense 

B  on  offense 

B  transitioning 

to  offense 

A  transitioning 

to  offense 

Team  A 

.3942 

.4097 

.4880 

.3254 

Team  B 

.5515 

.0493 

.4234 

.6512 

Ball 

.1685 

.0671 

.1133 

.2986 

Lastly,  we  calculate  the  expected  number  of  children  for  each  player  and  the  ball 
over  all  time.  As  discussed  in  Section  3.4.3,  the  number  of  children  can  be  expressed 
as  an  additive  function  and  the  expectation  is  calculated  using  Equation  3.106.  The 
additive  function  /  used  to  calculate  the  number  of  children  of  time-series  v  is  set  such 
that  fs,u  =  1  if  v  €  S  and  0  otherwise.  The  top  four  time-series,  in  terms  of  expected 
number  of  children,  are  the  ball,  point  guard  A  (PG  A),  forward  1  A  (FI  A),  and 
forward  1  B  (FIB)  with  expectations  of  1.76,  1.73,  1.08,  and  0.95.  Not  surprisingly,  the 
ball  has  the  largest  influence  on  the  dynamics  of  the  game.  The  point  guard  generally 
controls  the  flow  of  the  game  and  is  usually  the  best  ball  handler. 

As  a  consequence  of  the  framework  developed  in  Chapters  3  and  4  we  can  pose  more 
complex  probabilistic  questions  with  regard  to  the  posterior  distribution  over  interac¬ 
tion  structures.  For  example,  now  that  we  know  that  point  guard  A  has  significant 
influence  we  can  look  at  his  interactions  in  more  detail.  Figure  6.11  shows  the  posterior 
probability  of  an  edge  from  the  PG  A  to  every  other  player  over  all  time  given  a  sampled 
state  sequence.  Notice  that  the  switching  among  states  can  be  seen  in  terms  of  changes 
in  edge  posteriors.  PG  A  tends  to  influence  his  own  forward  as  well  as  the  point  guard 
and  forward  on  the  other  team. 

■  6.3  Summary 

In  this  chapter,  we  have  cast  the  problem  of  object  interaction  analysis  in  terms  of 
static  and  dynamic  dependence  analysis.  Vector  autoregressive  TIM(r)s  were  used  to 
described  causal  relationships  among  observed  trajectories  of  objects  and  the  priors 
defined  in  Chapter  3  allowed  for  tractable  Bayesian  inference. 

We  illustrated  the  utility  of  our  framework  via  experiments  characterizing  posterior 
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Figure  6.11.  Influence  of  Point  Guard  A  (PG  A):  Using  a  sampled  state  sequence  each  row  represents 
the  amount  of  influence  PG  A  has  on  a  specific  player  as  a  function  of  time.  Influence  is  expected  number 
of  edges  leaving  PG  A’s  time-series.  Lighter  colors  mean  higher  probability  of  influence. 

structural  uncertainty  among  the  observed  object  trajectories.  Static  dependence  anal¬ 
ysis  was  performed  on  the  output  of  a  simple  multi-person  interactive  computer  game 
and  results  were  obtained  which  were  consistent  with  one’s  intuition. 

Using  a  STIM  we  were  able  to  analyze  data  in  which  interactions  changed  over 
time.  The  posterior  uncertainty  in  structure  revealed  the  potential  for  improvements 
to  our  model  by  hypothesizing  more  interactive  states.  Lastly,  we  analyzed  data  from 
a  real  basketball  game.  While  we  are  not  at  the  level  of  discovering  playbooks,  detailed 
structural  inference  using  our  model  yielded  posterior  statistics  predictive  of  who  was 
on  offense  and  defense.  In  addition,  further  analysis  identified  the  key  influential  players 
and  how  their  influence  changed  over  time. 


p(edge  from  PG  A  to  ... ) 


100  200  300  400  500  600  700  800  900 

t 


154 


CHAPTER  6.  APPLICATION:  OBJECT  INTERACTION  ANALYSIS 


Chapter  7 


Conclusion 


In  the  preceding  chapters  we  have  developed  a  framework  for  analyzing  changing  rela¬ 
tionships  among  multiple  time-series.  We  cast  this  dynamic  dependence  analysis  task 
in  terms  of  inferring  the  underlying  structure  of  probabilistic  models  describing  ob¬ 
served  time-series.  By  doing  so  we  were  able  to  leverage  a  large  body  of  existing  work 
on  statistical  modeling  and  structure  inference.  Motivated  by  two  distinct  problems 
of  audio-visual  association  and  object  interaction  analysis,  our  primary  focus  was  on 
structural  inference  and  description  of  uncertainty  in  inferred  dependence  rather  than 
learning  and  recognition.  That  is,  our  goal  was  to  describe  the  dependence  among  ob¬ 
served  time-series  in  the  absence  of  training  data  rather  than  building  predictive  models 
for  future  recognition  tasks. 

■  7.1  Summary  of  Contributions 

We  introduced  two  static  dependence  models  for  time-series. 

A  FactM  allows  one  to  model  the  evolution  of  multiple  time-series  in  terms  of 
independent  groups.  A  TIM  allows  one  to  model  detailed  causal  relationships 
among  multiple  time-series. 

We  presented  a  decomposable  conjugate  prior  on  the  structure  and 
parameters  of  a  TIM. 

Taking  advantage  of  temporal  causality,  this  prior  allows  a  super-exponential  num¬ 
ber  of  structures  to  be  reasoned  over  in  exponential-time  in  general.  Restricting 
the  class  of  structures  such  that  each  time-series  has  a  bounded  number  of  influ¬ 
ences  allows  a  still  super-exponential  number  of  structures  to  be  reasoned  over 
in  polynomial-time.  Conjugacy  yields  tractable  calculation  of  exact  joint  and 
marginal  posterior  probability  of  structure  in  addition  to  posterior  expectations. 
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Furthermore,  an  exact  characterization  of  uncertainty  can  be  tractably  obtained 
with  nuisance  parameters  integrated  out. 

We  introduced  a  DDM  for  reasoning  about  dependence  relationships 
changing  over  time. 

A  DDM  extends  the  static  dependence  models  presented  in  Chapter  3  to  allow  for 
changing  dependence  structure.  Building  on  the  large  body  of  work  on  dynamic 
Bayesian  networks  and  HMMs,  one  can  reason  over  an  exponential  number  of 
sequences  of  dependence  relationships  in  linear-time.  Inference  using  a  DDM  can 
exploit  past  and  future  data  when  making  a  local  decision  about  dependence.  We 
showed  both  theoretically  and  empirically  how  this  property  yields  advantages 
over  standard  windowed  analysis. 

We  demonstrated  state-of-the-art  performance  on  an  audio-visual  speaker 
association  task. 

This  performance  was  achieved  without  the  benefit  of  training  data  or  a  silence 
detector.  In  addition,  there  were  no  window  size  or  threshold  parameters  to  set. 
The  semantic  label  of  who  is  speaking  was  naturally  mapped  to  a  specific  structure 
of  association  among  the  observed  audio  and  video  time-series  using  a  FactM. 

We  demonstrated  the  utility  of  dynamic  dependence  analysis  to  char¬ 
acterize  the  interaction  of  multiple  moving  objects. 

Object  interaction  analysis  was  formulated  in  terms  of  inference  over  TIM  de¬ 
pendence  structure.  A  conjugate  prior  allows  one  to  fully  characterize  posterior 
structural  uncertainty  among  observed  object  trajectories.  Results  obtained  when 
analyzing  simple  multi-person  interactive  computer  games  were  consistent  with 
the  behavior  of  the  instructed  participants.  Analyzing  the  trajectories  of  basket¬ 
ball  players  revealed  key  influential  players  and  who  they  influenced  over  time. 
Additionally,  posterior  statistics  on  structure  were  predictive  of  the  overall  state 
of  the  game. 

■  7.2  Suggestions  for  Future  Research 


In  this  dissertation  we  discussed  the  general  concept  of  a  dynamic  dependence  analysis 
task,  presented  two  specific  dependence  models  and  explored  their  use  in  two  distinct 
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applications  domains  that  best  matched  their  strengths.  We  believe  our  models  and 
extensions  thereof  can  be  useful  for  a  wide  variety  of  other  problems.  Below  we  overview 
some  possible  extensions  and  open  research  directions  for  future  work. 

■  7.2.1  Alternative  Approaches  to  Inference 

In  Chapter  3  we  presented  two  approaches  to  inference.  For  the  FactM  we  used  a 
maximum  likelihood  approach,  while  for  the  TIM  we  presented  priors  which  allowed  for 
exact  Bayesian  inference.  The  choice  of  which  approach  to  use  was  based  primarily  on 
the  end  applications  of  interest.  In  audio-visual  speaker  association  task,  our  end  goal 
was  to  make  a  decision  about  who  was  speaking.  Thus,  point  estimates  obtained  from 
an  ML  approach  using  a  FactM  were  suitable.  However,  one  could  also  place  a  discrete 
prior  on  the  set  of  factorizations  of  interest,  along  with  a  conjugate  prior  on  parameters 
for  the  FactM,  and  calculate  the  MAP  structure.  Some  initial  experiments  using  MAP 
estimates  of  structure  on  the  CUAVE  dataset  yielded  comparable  performance  to  the 
ML  approach.  A  more  extensive  analysis  and  experiments  on  other  datasets  would  be 
beneficial  to  those  interested  in  the  audio-visual  association  problem. 

In  the  object  interaction  analysis  task  our  primary  goal  was  to  explore  a  large  set 
of  structures  and  give  a  full  characterization  of  dependence.  Thus,  using  a  Bayesian 
approach  was  well  justified.  However,  an  ML  approach  could  be  also  applied  to  this 
problem.  A  combination  of  using  exact  Bayesian  inference  over  structure  with  ML  point 
estimates  of  parameters  is  another  alternative  that  could  be  explored. 

■  7.2.2  Online  Dynamic  Dependence  Analysis 

We  defined  dynamic  dependence  analysis  as  a  batch  analysis  task.  This  allowed  one 
to  combine  past  and  future  information  to  make  a  local  decision  about  dependence 
relationships.  The  end  applications  we  explored  were  in  this  batch  context:  provid¬ 
ing  metadata  to  describe  who  was  speaking  in  an  archived  video  and  post  analysis  of 
interactions  among  tracked  objects. 

A  future  avenue  of  research  is  to  extend  our  approach  to  allow  for  online  inference  of 
dependence  relationships.  Since  our  DDM  is  simply  an  HMM  at  the  highest  level,  one 
can  take  advantage  of  the  large  body  of  existing  work  on  online  learning  and  inference  in 
dynamic  Bayesian  networks  (c.f.  [67,  27]).  As  briefly  mentioned  in  Chapter  4,  our  static 
dependence  models  can  also  be  embedded  in  other  alternative  dynamic  models  such  as 
the  Parts  Partition  model  for  which  efficient  online  Bayesian  inference  algorithms  have 
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been  developed  [26,  28].  It  is  important  to  note  that  an  online  approach  will  still  have 
an  advantage  over  windowed  approaches  in  that  it  can  incorporate  past  information 
when  performing  inference. 

■  7.2.3  Learning  Representation  and  Data  Association 

As  mentioned  in  our  introduction  a  key  challenge  in  fusing  information  from  multiple 
sources  is  that  of  representation.  In  this  dissertation  we  assumed  a  sufficiently  infor¬ 
mative  representation  was  given  to  us  and  focused  on  identifying  dependence  among 
sources.  An  interesting  path  for  future  research  is  explore  efficient  ways  to  simultane¬ 
ously  learn  the  representation  and  structure  of  dependence.  We  expect  that  knowledge 
of  which  sources  of  data  are  dependent  and  when  could  help  one  learn  more  informative 
representations. 

Similarly,  we  assumed  each  observed  time-series  was  consistent  and  represented  a 
single  underlying  source  over  all  time.  That  is,  we  assumed  that  data  association 
was  solved.  Specifically,  we  assumed  that  a  tracker  provided  consistent  moving  object 
trajectories.  Obtaining  consistent  trajectories  is  a  difficult  task  for  tracking  systems, 
particularly  when  objects  are  close  together  and  may  occlude  one  another.  Thus,  one 
may  wish  to  jointly  track  and  reason  about  dependence.  Knowing  which  objects  are 
interacting  (and  how)  should  help  a  tracker  better  predict  future  positions  and  obtain 
more  consistent  tracks. 

■  7.2.4  Latent  Group  Dependence 


Figure  7.1.  Latent  Group  Dependence:  True  interaction  model  for  4  time-series  (left).  Grouping 
representation  (right)  where  there  are  3  latent  groups  a,b,  and  c.  Group  b  “clusters”  streams  2  and  4 
together.  Solid  arrows  represent  temporally  causal  influence  from  the  past  to  the  future.  Dotted  arrows 
represent  an  instantaneous  relationship. 


Our  static  dependence  models  allowed  one  to  reason  over  the  dependence  among 
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a  set  of  observed  time-series.  They  did  not  consider  the  potential  for  unobserved  fac¬ 
tors  causing  dependence  among  what  was  observed.  Consider  a  parade  with  multiple 
floats/vehicles  following  each  other  down  a  street.  Each  float  has  a  group  of  many  peo¬ 
ple  walking  next  to  it.  If  we  were  given  trajectories  for  the  people  and  not  the  floats, 
the  inferred  dependence  relationships  might  be  overly  complex.  That  is,  each  person 
would  appear  to  be  influenced  by  the  others  surrounding  the  same  float  and  at  the  same 
time  be  influenced  by  the  people  surrounding  the  float  ahead.  If  the  trajectories  of  the 
floats  were  also  observed  the  less  complex  description  may  emerge.  Each  float  can  be 
described  as  being  influenced  by  the  float  ahead  and  each  person  being  conditionally 
independent  given  the  float  they  are  surrounding.  Similarly,  when  analyzing  the  inter¬ 
action  of  many  people  in  a  crowd  ,  it  may  be  advantageous  to  model  interactions  among 
groups  of  people  rather  than  interactions  among  individuals. 

Figure  7.1  depicts  a  scenario  in  which  one  observes  4  time-series.  If  these  time- 
series  correspond  to  positions  of  moving  objects,  the  left  portion  of  the  figure  shows  an 
interaction  graph  that  may  emerge  when  object  3  is  being  followed  by  objects  2  and 
4  moving  together  as  a  group  with  object  1  following  them.  The  right  portion  of  the 
figure  shows  an  alternative  view  in  which  three  latent  groups  or  clusters  are  considered. 

A  latent  group  dependence  model  can  be  designed  in  which  each  of  the  N  observed 
time-series  are  assigned  a  label  associated  it  with  one  of  M  groups.  Each  latent  group 
is  represented  as  a  time-series.  Each  observed  time-series  is  modeled  to  be  conditionally 
independent  given  their  group  assignments.  Given  samples  of  the  latent  group  time- 
series  one  can  reason  about  their  dependence  using  one  of  the  static  dependence  models 
presented  in  this  dissertation.  Such  a  group  dependence  model  would  require  a  more 
complex  inference  procedure.  In  a  Bayesian  inference  task  a  Dirichlet  prior  could  be 
placed  on  group  assignments  and  a  Gibbs  sampler  used  to  generate  samples  from  the 
posterior.  This  latent  group  dependence  model  could  be  further  extended  to  allow 
group  assignments,  parameters  and  structure  to  change  over  time. 

■  7.2.5  Locally  Changing  Structures 

The  latent  states  used  in  our  DDM  index  into  a  finite  set  of  structures  and  parameters. 
A  change  in  latent  state  value  over  time  indicated  a  global  change  in  structure.  If 
one  removed  global  constraints  on  dependence  structure,  an  alternative  model  could 
be  designed  in  which  each  time-series  has  its  own  latent  state  z\.  The  latent  state  for 
time-series  v  could  specify  which,  if  any,  of  the  other  time-series  it  depends  on. 
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If  there  are  N  time-series,  each  with  its  own  latent  state  taking  on  K  values  such  a 
model  could  reason  over  KnT  possible  sequences  of  dependence  structures.  While  the 
DDM  presented  in  this  dissertation  with  KN  states  could  also  model  as  many  sequences, 
a  locally  switching  model  has  the  advantage  of  being  able  to  pool  information  from  times 
which  share  the  same  local  structure.  Figure  7.2(a)  shows  a  general  graphical  model 
for  allowing  local  dependence  changes  in  a  TIM.  A  specific  example  with  IV  =  2  is 
depicted  in  Figure  7.2(b)  in  which  each  latent  state  is  a  binary  value  indicating  whether 
a  particular  edge  is  present  or  not.  Color  is  used  to  indicated  which  latent  states  are 
controlling  which  edges. 


Figure  7.2.  Graph  Models  for  the  Proposed  Local  Dependence  Switching:  (a)  The  high  level  structure 
of  the  model,  (b)  A  specific  example  with  N  =  2  and  K  =  2  two  possible  states.  z\  controls  the 
presence  of  (red)  edges  going  into  x\  while  controls  (blue)  edges  going  into  x% . 


■  7.2.6  Long  Range  Temporal  Dependence 

The  static  dependence  models  presented  in  this  dissertation  assumed  r—th  order  tempo¬ 
ral  dependence.  The  current  value  of  each  time-series  could  only  be  directly  dependent 
on  the  value  other  time-series  within  the  window  of  r  past  time  points.  This  restric¬ 
tion  was  reasonable  for  the  applications  we  looked  at.  However,  consider  using  a  TIM 
to  model  a  teacher  student  relationship  among  moving  objects.  That  is,  imagine  one 
teacher  object  demonstrating  a  behavior  for  a  student  object.  The  student  can  observe 
the  teacher  and  then  recreate  the  behavior.  The  student  object  is  clearly  dependent 
on  the  teacher,  but  this  dependence  is  delayed  by  some  lag  l.  In  order  to  capture  this 
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relationship  using  a  TIM,  the  temporal  order  r  must  be  set  to  be  greater  than  or  equal 
to  l.  If  Z  is  large,  this  will  result  in  an  overly  complex  model  since  a  TIM  assumes 
dependence  on  all  r  past  values.  Thus,  a  useful  modification  of  the  TIM  to  explore  in 
the  future  may  be  one  that  can  incorporate  lags  such  that: 

N 

p(xt\±t,L,E,e)  =  nKx^^r?),e,|Pa(,)) ,  (7.1) 

V=1 

where  L  =  {Z1, . . . ,  lN}  is  a  set  of  lags  for  each  time-series.  Each  time-series  is  dependent 
on  its  own  ?’-th  order  past  xjr’  and  the  r-th  order  past  of  a  parent  set  of  time-series  lagged 
by  lv.  A  distribution  can  be  placed  over  these  lags  p  ( L ).  Depending  on  the  form  of  the 
lag  distribution  inference  may  still  be  tractable  in  a  Bayesian  setting.  That  is,  if  one 
only  considers  a  small  set  of  possible  discrete  lags,  the  complexity  of  inference  will  only 
increase  linearly  with  the  size  of  this  set. 

■  7.2.7  Final  Thoughts 

Modeling  and  understanding  evolving  dependence  structure  has  been  a  challenging  and 
intriguing  research  topic.  While  this  dissertation  has  focused  on  two  specific  appli¬ 
cations,  we  believe  the  ideas  and  models  presented  here  can  be  readily  adapted  and 
extended  to  a  wide  variety  of  other  domains.  Stepping  back  and  reassessing  ones  own 
research  is  often  a  humbling  experience.  It  is  easy  to  question  the  meaning  or  impact  of 
what  many  times  turns  into  a  very  personal  endeavor.  I  only  hope  the  models  and  ideas 
presented  in  this  dissertation  will  be  helpful  to  others  in  their  own  research  pursuits. 
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Appendix  A 


Directed  Structures 


In  Section  3.4.2  we  discussed  a  prior  on  the  directed  structure  E  which  encodes  the 
dependence  relationships  in  a  temporal  interaction  model  (TIM).  The  prior  has  the 
following  form: 

1  N 

PO  (E)  =  z(p')  II  Ppa-(v),v  (A.l) 

In  Section  3.4.3  we  showed  that  this  prior  is  conjugate  and  thus  the  posterior  takes  the 
same  form.  In  this  Appendix  we  detail  methods  for  sampling  E  using  this  prior  and/or 
the  posterior.  In  addition,  we  discuss  how  to  sample  parameters  for  a  TIM  given  a 
structure  E,  how  to  obtain  the  MAP  structure,  and  discuss  potential  numerical  issues. 

■  A.l  Sampling  Structure 

We  begin  by  outlining  how  to  sample  E  given  it  has  the  distribution  shown  in  Equation 
A.l.  First,  we  discuss  the  sampling  procedure  for  when  E  has  no  global  constraints  (i.e. 
E  £  An  or  E  £  V^)-  Next,  we  discuss  how  to  sample  E  when  it  is  restricted  to  be  a 
directed  tree  (E  £  Tn )  or  forest  ( E  £  En)- 

■  A. 1.1  Sampling  Without  Global  Constraints 

If  there  are  no  global  constraints  on  edges  in  the  structure  E,  the  parent  set  for  each  of 
the  N  time-series  can  be  sampled  independently.  By  enumerating  all  valid  parent  sets 
for  each  time-series  one  can  sample  a  parent  set  using  a  discrete  distribution.  That  is, 
let  be  the  i-tli  possible  parent  set  for  time-series  v.  The  index  i  can  range  from  0 
to  the  total  number  of  possible  parent  sets,  qv.  A  parent  set  can  be  then  be  sampled 
using  a  discrete  distribution  over  qv  possible  values.  For  example,  if  N  =  3  and  E  £  An 
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(can  be  any  directed  structure),  then  q\  =  <72  =  Q3  =  4  and 
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=  {3} 
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R 

k3 

=  {}, 

R  ^ 

r3 

=  {1}, 

R 

r3 

=  {2} 

(A.7) 

Algorithm  A.  1.1  outlines  the  sampling  procedure.  Note  that  the  sampling  procedure 
requires  one  to  enumerate  all  allowable  parent  sets  for  each  v  £  {1, . . . ,  N}.  When  all 
structures  are  allowed,  E  £  An,  each  v  has  2N~l  allowable  parent  sets  that  must  be 
indexed  into.  In  practice,  we  store  a  bidirectional  mapping  between  an  N  bit  number 
representing  a  parent  set  and  an  index.  The  non-zero  bits  in  the  N  bit  number  indicate 
which  time-series  are  parents.  When  E  £  An  this  map  is  straightforward  and  one  can 
use  the  N  bit  number  itself  as  the  index.  When  the  size  of  the  parent  set  is  restricted, 
E  £  Ty ,  we  form  the  map  explicitly  by  recursively  looping  over  all  parent  sets  of  size 
0  to  K. 


■  A. 1.2  Sampling  Directed  Trees  and  Forests 

When  E  is  restricted  to  be  a  directed  tree  or  forest  one  can  no  longer  treat  each  parent 
set  independently.  Fortunately,  sampling  directed  trees  is  a  well  studied  problem  and 
one  can  easily  transform  existing  procedures  for  trees  to  also  sample  directed  forests. 
Most  directed  tree  sampling  algorithms  can  be  categorized  as  either  determinant  or 
random  walk  based  approaches.  All  approaches  can  be  initialized  by  randomly  choosing 
a  root  based  on  Equation  3.97. 

Determinant  based  approaches  such  as  [19]  randomly  choose  an  edge  based  on  the 
edge  probabilities  calculated  using  the  Matrix  Tree  Theorem,  subsequently  contracting 
chosen  edges  until  a  tree  is  formed.  A  straightforward  implementation  has  a  complexity 
of  0(N 4)  while  improvements  can  be  made  to  reduce  the  running  time  to  0(M(N)) 
where  M(N)  is  the  complexity  of  multiplying  two  N  x  N  matrices. 

Random  walk  based  algorithms  simulate  a  random  walk  on  a  stochastic  graph  de¬ 
fined  by  /3  starting  at  a  randomly  sampled  root.  While  walking  on  the  graph  one  notes 
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Algorithm  A.  1.1  Sampling  E  Without  Global  Constraints 

Require:  The  full  set  of  parameters  /3,  and  an  enumerated  set  of  all  possible  parents 
for  each  of  the  N  vertex  (time-series)  R. 
function  SampleEBarWithoutGlobalConstraints(/3,  R) 

%  Start  with  no  edges 
E<-{} 

for  v  =  1  to  N  do 


%  Define  probabilities  for  each  possible  parent  set 

for  i  =  1  to  qv  do 

^  ¥ R« 

end  for 


%  Sample  a  parent  set  index 

k  ~  Discrete  (.;  7Ti, . . .  ,Ttqv) 

%  Add  the  edges  for  the  indexed  parent  set  to  v 
for  each  u  €  R^' j  do 
Add  edge  ( u ,  v )  to  E 

end  for 
end  for 
return  E 
end  function 


each  vertex  that  has  been  visited,  erasing  cycles  as  they  are  created  until  the  walk 
has  generated  a  tree.  The  algorithm  presented  by  [93]  has  an  expected  running  time 
proportional  to  the  mean  hitting  time  on  the  stochastic  graph.  The  mean  hitting  time 
is  the  expected  time  to  go  from  any  random  vertex  to  any  other  given  the  steady  state 
distribution  of  the  stochastic  graph. 

Both  approaches  have  advantages  and  disadvantages.  Wilson’s  algorithm  is  very 
simple  to  implement  and  for  many  graphs  the  mean  hitting  time  is  much  less  than 
N3.  However,  it  possible  to  have  a  (3  that  yields  graphs  with  exponentially  large  hit¬ 
ting  times.  While  determinant  based  algorithms  have  fixed  complexity,  they  are  more 
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complex  and  require  careful  implementations  to  ensuring  adequate  precision  [19].  We 
discuss  related  numerical  issues  further  in  Section  A. 4. 

In  this  dissertation,  we  use  a  simple  modification  of  Wilson’s  RandomTreeWith- 
Root  procedure  for  sampling  both  directed  trees  and  forests.  Algorithm  A. 1.2  outlines 
the  sampling  procedure  given  f3  in  matrix  form.  Note  that  in  order  to  be  consistent 
with  Wilson’s  original  description  in  [93]  Algorithm  A. 1.2  samples  a  directed  tree  with 
all  its  edges  reversed  (i.e.  the  root  is  defined  to  have  no  children  and  leaves  have  no 
parents).  We  convert  between  this  edge  reversed  tree  and  our  desired  form  with  the 
supporting  functions  shown  in  Algorithm  A. 1.3. 

We  modify  Wilson’s  algorithm  by  setting  a  limit  to  the  number  of  iterations. 

This  allows  us  to  restart,  try  a  different  approach,  or  adjust  our  (3  to  approximate 
our  distribution  with  one  which  has  a  lower  mean  hitting  time. 

If  we  wish  to  sample  a  directed  forest  we  can  simply  create  a  virtual  root  vertex 
N  +  1  and  modify  (3  such  that  the  weights  from  node  IV  +  1  to  any  v  is  the  root 
weight  for  v.  That  is  set  (3n+i,v  =  (3v,v  for  all  v  £  {1  We  can  then  call 

RANDOMTREEWlTHRoOT(N+l,/3,maxIter)  to  get  a  random  forest. 

■  A. 2  Sampling  Parameters 

Given  a  sampled  structure  we  can  also  sample  parameters  from  the  prior  po  (@|.E)  or 
posterior  p  (@| E,  T>) .  As  discussed  in  Section  3.4.2  our  prior  on  parameters  is  conjugate 
and  both  tasks  involve  sampling  from  the  form: 


N 


(A.8) 


V=1 


where  the  hyperparameters  T  are  updated  as  a  function  of  the  observations  T>  when 
obtaining  the  posterior.  This  modular  and  independent  structure  of  the  distribution 
allows  one  to  sample  each  parameter  independently  as  shown  in  Algorithm  A. 2.1. 
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Algorithm  A. 1.2  The  RandomTreeWithRoot  Algorithm  with  Iteration  Limit 
Require:  /3  in  matrix  form  with  element  i.  j  being 
function  RandomTreeWithRoot^,  /?,maxlter) 

G  <—  MakeGBar(/3) 
for  i  =  1  to  N  do 
InTree[i]  <—  false 
end  for 
Next[r]  nil 
InTree[r]  <—  true 
for  i  =  1  to  N  do 
Iter  4—  0 
u  <—  i 

while  not  InTreefu]  do 

Next  [it]  4—  RandomSuccessor(r,G') 
u  4—  Next  [it] 

Iter  4—  Iter  +1 
if  Iter  >  maxlter  then 
failed 
end  if 
end  while 
u  4 —  i 

while  not  InTree[u]  do 
InTree['«]  4—  true 
u  <—  Next  [it] 

end  while 
end  for 

E  4—  NEXTToEBAR(Next) 

end  function 
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Algorithm  A. 1.3  The  RandomTreeWithRoot  Supporting  Functions 
function  MakeGBar(/3) 

%  Transpose  beta  to  reverse  edge  directions 
%  This  allows  us  to  be  consistent  with  Wilson’s  algorithm 

for  v  =  1  to  N  do 


Pv,v  0 
z  Xa=1  Pv,i 

for  u  =  1  to  N  do 

Gv,i  =  ~£@v,i 

end  for 
end  for 
return  G 
end  function 


function  RandomSuccessor(u,G) 
n  ~  Discrete  (.;  Gu>1, GU>N) 

return  n 
end  function 


function  NEXTToEBAR(Next) 

for  v  =  1  to  N  do 

if  Next[w]  not  nil  then 

%  Note  again,  we  are  reversing  the  roll  of  Next 
Add  edge  (Next[u],u)  to  E 

end  if 
end  for 
end  function 


Algorithm  A. 2.1  Procedure  for  Sampling  Parameters  Given  Structure 
Require:  A  sampled  structure  E  and  hyperparameters  T 
for  v  =  1  to  N  do 

®n|pa(u)  ~  PO  (©n|pa(t;)  1^) 

end  for 
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■  A. 3  Obtaining  the  MAP  Structure 

In  some  situations  one  may  wish  to  obtain  the  MAP  structure.  That  is,  find 

N 

E*  =  arg  max  [J  Ppa^E),v  (A-9) 

E  V=1 

N 

=  argmax^log  f3pa^>v  (A.10) 

E  v=l 

When  no  global  constraints  are  imposed  on  E  the  MAP  structure  can  be  found  by  iden¬ 
tifying  the  MAP  parent  set  independently  for  each  time-series.  Thus,  it  the  procedure’s 
complexity  is  proportional  to  the  N  times  the  size  the  set  of  allowable  parents  for  each 
vertex.  Algorithm  A. 3.1  outlines  this  procedure. 

Algorithm  A. 3.1  Obtaining  MAP  E  without  Global  Constraints 
Require:  The  full  set  of  parameters  /?,  and  an  enumerated  set  of  all  possible  parents 
for  each  of  the  N  vertex  (time-series)  R. 
function  MAPEBarWithoutGlobalConstraints(/3,  R) 

E*^{} 

for  v  =  1  to  N  do 


m  < - Inf 

2  ^  Ta= 1 

for  i  =  1  to  qv  do 

if  7Tj  >  m  then 

m  <—  ni 

k  4—  i 


end  if 
end  for 

for  each  u  €  R^  do 
Add  edge  (u,  v )  to  E* 

end  for 
end  for 
return  E* 
end  function 
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■  A. 3.1  Directed  Trees  and  Forest 

Finding  the  MAP  directed  tree  or  directed  forest  is  well  studied  problem.  Finding  the 
MAP  structure  can  be  thought  of  as  a  maximum  weight  branching  (forest)  problem, 
a  minimum  weight  arborescence  (directed  tree)  problem,  or  a  minimum  weight  rooted 
arborescence  problem  (c.f.  [57]).  Each  of  these  problems  are  equivalent  and  algorithmic 
solutions  were  found  independently  by  Chu  and  Liu  [18],  Edmonds  [25]  and  Bock  [10]. 

Here,  for  completeness,  we  provide  the  algorithm  in  the  context  of  the  maximum 
weight  rooted  directed  tree  problem.  Given  a  set  of  directed  edges  E,  a  designated  root 
r  and  a  positive  weighting  function  w(u,  v )  G  M+  for  each  edge  (u,  v)  G  E  we  wish  to 
find  the  subset  E*  which  maximizes  the  sum  of  weights  v)e E*  w{ui v )■  The  steps  of 
the  solution  are  outlined  Algorithm  A. 3. 2.  We  assume  there  exists  at  least  one  directed 
tree  from  root  r. 

At  a  high  level  there  are  four  phases.  First,  one  greedily  selects  edges.  If  the  result 
is  a  directed  tree,  the  algorithm  returns.  If  not,  second,  it  finds  a  cycle  in  the  resulting 
edge  set  and  modifies  the  problem  such  that  the  cycle  is  represented  by  a  new  vertex 
and  new  edges  and  weights  are  added  to  and  from  this  new  cycle  vertex.  This  step  is 
outlined  in  Algorithm  A. 3. 3.  The  third  phase  takes  this  modified  graph  and  recursively 
calls  the  algorithm.  The  fourth  phase  takes  the  result  from  this  recursive  call  and 
cleans  up  by  expanding  the  contracted  cycle  node  and  remapping  edges  using  book 
keeping  established  in  Algorithm  A.3.31.  A  straightforward  implementation  has  worst 
case  performance  0(N3).  Each  greedy  edge  selection  step  is  0(N 2)  and  one  can  recurse 
a  maximum  of  N  times.  More  efficient  implementations  such  as  [86]  are  0(N2). 

We  can  map  the  problem  of  calculating  the  MAP  directed  tree  (E  G  Tjy)  to  the 
maximum  weight  rooted  directed  tree  problem.  First,  we  use  logpo  (E)  and  map 
w(u,  v )  =  log  Pu,v  Note  that  while  the  log  is  likely  to  produce  negative  weights  one  can 
always  shift  the  weights  by  a  constant  so  they  are  non-zero  as  the  solution  is  invariant 
to  constant  shifts  (or  equivalently  multiplying  the  posterior  by  a  constant).  Second, 
we  have  to  address  the  fact  that  we  do  not  know  the  root.  We  can  deal  with  this  in 
two  ways.  One  way  is  to  simply  call  ChowLiuEdmondsBock  for  all  N  roots  r  and 
compare  the  results,  pick  the  resulting  directed  tree  with  the  highest  probability. 

Another  approach  is  to  add  a  virtual  root  node  vr  and  create  an  edge  from  vr  to  all 

1We  found  the  algorithmic  description  and  example  in  [61]  to  be  helpful  when  implementing  these 

functions.  We  only  add  them  here  for  completeness  and  to  be  explicit  about  the  extra  book  keeping 

steps. 
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Algorithm  A. 3. 2  The  Chu-Liu,  Edmonds,  Bock  Algorithm 

Require:  A  set  of  allowable  edges  E  on  vertices  V,  a  designated  root  r  and  a  weighting 
function  w. 

Ensure:  The  returned  edge  set  E*  forms  a  directed  spanning  tree  at  root  r  with  max¬ 
imum  weight  J2(u,v)eE*  w(u,v) 
function  ChuLiuEdmondsBock(V,  E,  r,  w) 

%  Greedy  Seletion 

Efl  -  {} 

for  each  «  /  r  in  V  do 

E9  <—  E9  U  argma w)  %  Add  best  incoming  edge 
end  for 

if  E9  has  no  cycles/loops  then 
return  E9 
end  if 

%  Cycle  contraction  and  graph  re-weighting 
Find  a  cycle  C  in  E9 

{Vc,Ec,wc,' %,$,vc}  <—  Contract(C,  V,E,w)  %  A. 3. 3 
%  Recursive  call  on  modified  graph 
E*  <—  ChuLiuEdmondsBock(Ec,  Ec,  r,  wc) 

%  Clean  up  edges  into  C 

V  Pa  (vc,  E*),  z  <-  $(y,  vc),  x  <—  pa  (z,  C) 

E*^E*\(y,vc) 

E*  <—  E*  U  (y,  z )  %  Add  true  incoming  edge 
E*  <—  E*  U  C  %  Add  back  the  cycle 
E*  <—  E*  \  ( x ,  z)  %  Break  the  cycle 
%  Clean  up  edges  out  of  C 

for  each  y  s.t.  (vc,  y)  £  E*  do 

2  <-  y(vc,y) 

E*^E*\(vc,y) 

E*<-E*U(z,y) 

end  for 
return  E* 
end  function 
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Algorithm  A. 3. 3  Contract  Function  used  by  Algorithm  A. 3. 2 

Require:  A  cycle  C,  a  set  of  vertices  V,  allowable  edges  E  and  a  weighting  function  w 
Ensure:  Return  a  modified  graph  {Vc,  Ec  j  in  which  all  vertex  in  C  have  been  collapsed 
into  a  single  vectex  vc  and  edge  weights  are  modified  to  be  wc.  The  edge  weights  are 
calculated  such  that  finding  the  maximum  spanning  tree  in  this  modified  graph  can 
be  mapped  back  to  the  original  problem  using  book  keeping  information  'k  and  <k. 
function  Contract(C,  V,  E,  w) 

Let  vc  be  a  new  vertex  representing  C 
Vc  <—  E\  vertices  (C)  Uuc 
Ec  <—  E  \  C 
wc  < —  w 

%  Deal  with  edge  into  C 

Tc  <—  Y^(u  v)eC  w(u > v)  %  Total  cost  of  cycle 
for  each  vertex  y  ^  C  s.t.  there  exists  an  edge  (y,  z )  with  z  £  C  do 
z  <—  argmaXj,/  w(y,  z')  +  T  —  rc(pa  (z' ,  C ) ,  z') 
wc(y,  vc)  w{y,  z)  +  T  —  w(x,  z) 

Ec  <—  ECVJ  (y,  vc) 

&(y,vc)  <—  z  %  Remember  z  is  the  true  end  of  this  edge 
end  for 

%  Deal  with  edges  out  of  C 

for  each  vertex  y  £  C  s.t.  there  exists  an  edge  (z,  y)  with  z  €  C  do 
z  <—  argmax2/  w(z' ,  y) 

Wc(vc,y)  e-  w(z,y) 

Ec  <-  Ec  U  Oc,  y) 

ty(vc,y)  <—  z  %  Remember  z  is  the  true  start  of  this  edge 
end  for 

return  {Vc,  Ec,  wc,  T,  vc} 

end  function 


other  vertices  v  with  weights  w(vr,v)  =  f(\og(3V)V).  The  function  /()  must  make  the 
weights  such  that  the  resulting  maximum  weight  directed  tree  found  for  this  modified 
graph  only  has  one  edge  from  vr  which  points  to  the  true  root.  This  can  be  achieved 
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by  using 


N 


f(a)  =a- 


N 


f3v 


\V=1 


—  min  w(u,v) 

(u,v) 


(A.ll) 


This  function  ensures  that  w(vr,v)  =  f(log/3v>v)  is  small  enough  such  that  even  if  you 
picked  all  edges  from  vr  the  sum  total  weight  will  be  less  than  picking  the  smallest  edge 
among  the  vertices  other  than  vr.  It  also  keeps  the  same  relative  relationships  among 
the  root  weights.  Calling  ChowLiuEdmondsBock  with  vr  designated  as  the  root  will 
produce  the  MAP  directed  tree  after  removing  vr  from  the  result. 

The  problem  of  estimating  the  MAP  directed  forest  (E  €  En)  is  also  straightfor¬ 
ward.  We  again  use  the  mapping  w (u,v)  =  log  /3U<V  and  create  a  virtual  root  node  vr. 
However  this  time  the  root  weights  w{vr,v)  are  simply  log f3vv.  Calling  ChowLiuEd¬ 
mondsBock  with  vr  designated  as  the  root  will  produce  the  MAP  directed  forest  after 
removing  vr  from  the  result. 


■  A. 4  Notes  on  Numerical  Precision 

It  is  important  to  note  that  one  must  be  careful  with  numerical  precision  when  using  the 
temporal  interaction  model  (TIM)  presented  in  this  dissertation.  While  performing  and 
storing  results  of  calculations  in  the  log  domain  is  good  practice,  there  are  situations 
in  with  issues  which  floating  point  accuracy  cannot  be  avoided.  For  example,  when 
dealing  with  large  datasets  (large  T )  the  evidence  weights,  W,  calculated  in  Equation 
3.89  become  extremely  small.  In  addition,  the  ratio  of  the  smallest  evidence  term  to 
the  largest  can  also  become  quite  large.  When  reasoning  over  directed  trees  or  forests 
these  extreme  values  cause  ill  conditioned  matrices  Q(f3  o  W )  (see  Equation  3.68)  and 
makes  accurate  calculation  of  determinants  of  this  matrix  difficult.  A  determinant  is 
required  when  calculating  the  partition  function  and  estimating  event  probabilities  (see 
Sections  3.4.2  and  3.4.3). 

The  datasets  used  in  this  dissertation  were  small  enough  such  this  issue  did  not 
arise  when  dealing  with  the  partition  function.  However,  issues  arose  when  sampling 
directed  trees  or  forests  from  a  posterior.  The  evidence  weights  caused  Wilson’s  Ran- 
domTreeWithRoot  algorithm  to  take  a  long  time  to  complete  or  in  some  cases  loop 
indefinitely.  The  root  cause  of  such  behavior  was  that  the  stochastic  graph  formed  using 
these  weights  had  extremely  large  mean  hitting  times.  Our  solution  to  this  problem  was 
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to  set  a  maximum  number  of  iterations  for  the  algorithm2.  If  the  maximum  number  of 
iterations  was  reached  we  resort  to  an  importance  sampling  technique  using  a  proposal 
distribution  based  on  a  modified  set  of  evidence  weights.  The  proposal  distribution 
used  evidence  weights  which  were  rescaled  so  that  they  fit  in  the  range  [ —K ,  1]  in  the 
log  domain.  Rescaling  in  the  log  domain  is  a  nonlinear  operation  but  maintains  the  rel¬ 
ative  ordering  of  the  evidence  weights,  effectively  broadening  the  proposal  distribution 
over  structure  with  respect  to  the  true  posterior.  A  similar  approach  was  taken  by  [17] 
for  undirected  trees.  [17]  also  use  a  library  (NTL)  [80]  which  allows  them  to  calculate 
determinants  to  a  desired  precision  (at  the  cost  of  speed). 


2  For  the  basketball  data  this  was  set  to  a  number  such  that  the  sampler  took  no  longer  than  20 
seconds  to  finish 
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In  this  Appendix  we  provide  details  on  and  derivations  for  various  equations  presented 
in  this  dissertation. 


■  B.l  Derivation  of  Equations  3.38  and  3.40 

Recall  that  the  log  likelihood  ratio  has  the  following  form: 

p(vt\Vt,F\@l 

h,2  =  loS  —( - ~ - 

t  p(pf|A,F2,02 
Taking  the  exception  under  H\  yields: 

p[Vt\VuF\Ql 
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dVtdVt 


(B.5) 


It  breaks  up  into  two  terms  in  Equation  B.5.  The  first  term  is  well  defined,  but  the 
second  is  not.  Let  us  look  at  this  second  term  and  simplify  notation.  It  can  also  be 
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broken  down  into  two  sub-terms: 


r  MF\F^Fnn 

Jvur>t  p2(F2  |F2) 


[  ^_p1(F1,F1)]ogp1(Fn\Fn)-  f  ~p1(F\F1)\ogp2(F2\F2) 


(B.6) 


where  we  use  the  notation  pb(Fa\Fa)  =  p  (vt\T>t,  FA,  @B^J  and  drop  the  dVt  and  dVt 
for  space.  Each  sub-term  can  be  looked  at  separately,  starting  with  the  first  sub-term: 


[  pi(F1 ,  F1)logPi(Fn\Fn)  =  [  p^F^F^logYlp^F^) 

JVt,Vt  JVt,Vt  k=1 
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fe=i  JT>tFt 
|Fn| 
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'T>t,T>t 


Pi(F  \  Fn)  logpi(Fr  |Fn) 


(B.7) 

(B.8) 

(B.9) 

(B.10) 

(B.ll) 


The  progression  from  Equation  B.8  to  B.9  is  due  to  the  fact  that  all  terms  other  than 
F/}  get  marginalized  and  that  Fk  is  a  subset  of  one  of  the  factors  in  F1.  Equation  B.9 
becomes  B.10  by  adding  back  terms  which  will  be  marginalized  out,  this  time  in  terms 
of  the  common  factorization  Fn. 


Sec.  B.l.  Derivation  of  Equations  3.38  and  3.40 


177 


The  second  sub-term  is 
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■b,.Vt  JVuVt  j=1 

T2l  r 

=  ~E  /  Pi(Fl iF^logp^FjlFf) 

:  ,  JVt  .V, 


J=1  JVt’Vt 

iF2'  r  r 

E  /  f 

j=l  J Vt\Fj,Dt\Fj 


Pl{F\F')\ogp2(Ff\Ff) 


\f2\ 

E 


IF1! 


I  ip2  t?2  *“ 

i=rb’b  »= i 
|f2| 


I  Pi(^  n  Fj  ,  T;1  n  Fj)  log p2(Fj\Fj 


E  /  Pi(Fn,Tn)logp2(F/|F/) 
/  =  1  lOtA 

/  Pl(Fn,Fn)logP2(F2|F2) 
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(B.12) 

(B.13) 

(B.14) 

(B.15) 

(B.16) 

(B.17) 


where,  here,  Equation  B.14  is  marginalizing  pi(l?1|l?1)  over  all  terms  but  Fj.  This 
marginalization  yields  a  factorization  which  is  formed  by  intersecting  each  factor  in  F1 
with  Fj  as  in  Equation  B.15.  These  intersection  terms  are  by  definition  consistent  with 
Fn  and  Equation  B.16  simply  reintroduces  all  terms  of  p\(Fn\Fn)  which  were  being 
marginalized  out. 

Plugging  Equations  B.ll  and  B.17  into  Equation  B.6  and  expanding  notation  yields 


.  pi(Fn\Fr 


[  pi  (E1 ,  E1)  log  : 

Jvt,i \  P2(F2\Fi) 


(B.18) 
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(B.20) 

(B.21) 

(B.22) 


Plugging  this  into  Equation  B.5  we  get  the  final  result: 
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Ev  [li,2\H1}  =  Y,D{P{vt\Vt,F\@1 

t 

+  J2D(p('Dt\Vt,Fn,01 


p(vt\vt,Fn,el 

p(vt\Vt,F2,Q2 


The  form  of  E©  [^1,2^2]  can  be  symmetrically  derived. 


(B.23) 

(B.24) 


■  B.2  Consistent  ML  Estimates:  Equations  3.42  through  3.45 

The  definition  of  a  consistent  estimator  is  that  the  estimate  0  from  data  F>\-t  asymp¬ 
totically  converges  true  parameter  0  as  the  amount  of  data,  T,  grows  without  bound. 
There  are  many  forms  of  convergence.  Here,  we  will  define  consistency  in  terms  of  weak 
convergence  in  distribution  such  that  as  T  — >  00,  p  ^0^j  will  asymptotically  approach 
a  delta  function  on  0.  By  this  definition,  if  our  ML  estimate  of  01  is  consistent  then, 
given  enough  data  under  H\,  Equation  3.42  holds  true.  That  is, 

p(vt\VuF\Q1^p(vt\Vt,F\Q1^  .  (B.25) 

Next  we  look  at  what  happens  to  p  (vt \T>t,  F1, 02^  given  data  under  H\ .  For 
simplicity,  we  first  explore  the  case  in  which  r  =  0  and  the  quantity  of  interest  becomes 
p  (r>t\Fl ,  02^  •  This  distribution,  by  definition  of  a  FactM(O)  is  invariant  to  t.  An  ML 
estimate  of  02  given  data  from  H\  takes  the  form: 


02  =  argmax^^logp(Pi|F2,0)  (B.26) 

0  1  t= 1 

As  T  — >  00,  using  the  law  of  large  numbers  the  sample  average  converges  to  the 
expectation: 

02  =  argmaxE^i^  [logp  (Vt\ F2,©)]  (B.27) 

=  argmax  /  p  (Vt j-F1, 01)  logp  (Vt |F2,  0)  dVt  (B.28) 

e  Jvt 

=  arg  max  f  p  (Vt\ F1, 01)  log  P^\  ^  dVt  (B.29) 

e  Jvt  P\Pt \Fn,Qi-) 

=  arg  min  f  p  (Vt\ F1, 01)  log  P  dT>t  (B.30) 

=  arg min  D  (  p(Vt\Fn,®1)  ||  p  (Vt \F2,  0)  )  (B.31) 
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where  B.29  introduces  a  term  which  is  constant  with  respect  to  0  and  the  last  line  uses 
our  derivation  of  Equation  B.22  above.  Note  that  p(Vt\F2,Q)  is  a  more  expressive 
model  than  p  (Vt\Fn,  01)  since  the  factors  in  Fn  are  subsets  of  the  factors  in  F2  by 
definition.  Thus, 

arg min  D  (  p{Vt\Fn,Ql)  ||  p  (Vt\F2,  0)  )  =  0  (B.32) 

and 

p(pf|F2,02)^p(A|i?n,01)  (B.33) 

That  is,  the  ML  estimate  of  02  is  the  one  that  produces  p  ^T>t\F2,  02^  most  consistent 
with  the  true  hypothesis  under  the  common  factorization. 

The  r  =  0  case  is  used  in  all  the  examples  and  experiments  presented  in  this  disser¬ 
tation.  Intuitively,  a  similar  relationship  should  hold  true  for  r  >  0.  However,  for  r  >  0 
one  cannot  use  the  law  of  large  numbers  to  turn  the  sample  average  into  an  expectation 
because  one  does  not  obtain  independent  samples  over  time.  That  is,  for  r  >  0: 

02  =  arg  max  -  ^  logp  (vt\Vt,  F2,  ©)  (B.34) 

0  1  t=  l 

where  each  T>t  is  no  longer  independent  of  all  other  times.  If  one  could  prove  that  as 
T  — >  oo, 

1  T 

-  ^  logp  (Vt\T>t,  F2,  0^  ->  ED{H1 

t= i 

then  one  could  follow  the  same  form  of  analysis  above  to  show: 

p  (vt\Vt,  F2,  02)  ^ p  (vt\Vt,  Fn,  01)  (B.36) 

Equation  B.35  holds  true  when  our  static  dependence  model  is  ergodic  and  station¬ 
ary.  If  stationary,  all  joint  distributions  are  invariant  to  t.  If  ergodic,  sample  averages 
over  time  converge  to  ensemble  averages. 


log  p(Vt\Vt,F2,Q 


(B.35) 
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■  B.3  Derivation  of  Expectations  of  Additive  Functions  over  Structure 

Equation  3.105:  For  E  €  An  or  E  €  Ey  the  expectation  over  additive  function  /  takes 
the  form 
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Equation  3.106:  If  E  €  T/v  then 
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(B.37) 

(B.38) 

(B.39) 

(B.40) 

(B.41) 

(B.42) 

(B.43) 

(B.44) 

(B.45) 

(B.46) 


(B.47) 
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The  last  line  use  the  form  for  the  probability  of  a  root  and  the  expectation  over  an 
additive  function  given  a  root  follows  the  form  shown  in  [62]  substituting  in  the  directed 
tree  partition  function  in  place  of  the  undirected  version. 

■  B.4  Derivation  of  Equations  4.20  and  4.21 

The  likelihood  ratio  comparing  two  hypothesized  state  sequences  with  r  =  0  takes  the 
form: 


h,2  =  log  ■ 
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(B.48) 

(B.49) 

(B.50) 

(B.51) 

(B.52) 


where  d  =  {t\at  7^  bt}  is  the  set  of  all  time  points  in  which  the  hypothesized  state 
sequences  differ.  Equations  4.20  and  4.21  are  obtained  by  taking  the  expectation  of 
I12  over  data  drawn  from  each  hypothesized  state  sequence  and  using  the  same  form 
of  derivation  shown  in  Section  B.l  with  r  =  0.  That  is,  for  each  t  we  simply  map  at  to 
FI  1  and  bt  to  H2  ■ 
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